Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

13 posts Audio Adversarial Examples AXRP

18 posts Redwood Research Adversarial Training AI Robustness

50 AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilan

1y

10

47 AXRP Episode 10 - AI’s Future and Impacts with Katja Grace

DanielFilan

1y

2

46 AXRP Episode 7 - Side Effects with Victoria Krakovna

DanielFilan

1y

6

41 AXRP Episode 12 - AI Existential Risk with Paul Christiano

DanielFilan

1y

0

37 AXRP Episode 3 - Negotiable Reinforcement Learning with Andrew Critch

DanielFilan

1y

0

36 If I were a well-intentioned AI... I: Image classifier

Stuart_Armstrong

2y

4

32 [AN #62] Are adversarial examples caused by real but imperceptible features?

Rohin Shah

3y

10

28 AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

DanielFilan

1y

3

26 AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

DanielFilan

1y

1

26 AXRP Episode 11 - Attainable Utility and Power with Alex Turner

DanielFilan

1y

5

18 AXRP Episode 2 - Learning Human Biases with Rohin Shah

DanielFilan

1y

0

18 The Goodhart Game

John_Maxwell

3y

5

16 AXRP Episode 1 - Adversarial Policies with Adam Gleave

DanielFilan

1y

5

170 Redwood Research’s current project

Buck

1y

29

134 Takeaways from our robust injury classifier project [Redwood Research]

dmz

3mo

9

129 Why I'm excited about Redwood Research's current project

paulfchristiano

1y

6

118 Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau

1mo

14

106 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

98 High-stakes alignment via adversarial training [Redwood Research report]

dmz

7mo

29

73 Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

KevinRoWang

1mo

5

54 Redwood's Technique-Focused Epistemic Strategy

adamShimi

1y

1

39 Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

Buck

6mo

0

30 AXRP Episode 15 - Natural Abstractions with John Wentworth

DanielFilan

7mo

1

27 Causal scrubbing: results on a paren balance checker

LawrenceC

17d

0

26 AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

DanielFilan

8mo

9

26 AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo

DanielFilan

8mo

1

18 AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

DanielFilan

4mo

0