Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

31 posts Audio Redwood Research AXRP Interviews Adversarial Examples Adversarial Training AI Robustness

9 posts Transcripts Autonomous Weapons

130 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

134 Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau

1mo

14

24 Latent Adversarial Training

Adam Jermyn

5mo

9

135 Takeaways from our robust injury classifier project [Redwood Research]

dmz

3mo

9

86 Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

KevinRoWang

1mo

5

136 High-stakes alignment via adversarial training [Redwood Research report]

dmz

7mo

29

48 Redwood's Technique-Focused Epistemic Strategy

adamShimi

1y

1

16 AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

DanielFilan

4mo

0

12 AXRP Episode 1 - Adversarial Policies with Adam Gleave

DanielFilan

1y

5

13 AXRP Episode 2 - Learning Human Biases with Rohin Shah

DanielFilan

1y

0

34 AXRP Episode 7 - Side Effects with Victoria Krakovna

DanielFilan

1y

6

10 AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong

DanielFilan

3mo

1

33 Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

Buck

6mo

0

41 AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilan

1y

10

23 [AN #146]: Plausible stories of how we might fail to avert an existential catastrophe

Rohin Shah

1y

1

24 AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra

DanielFilan

1y

1

40 Rohin Shah on reasons for AI optimism

abergal

3y

58

43 A conversation about Katja's counterarguments to AI risk

Matthew Barnett

2mo

9

35 Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus Astra

2y

4

44 Conversation with Paul Christiano

abergal

3y

6

56 AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant

DanielFilan

1y

2

11 AI Alignment Podcast: On Lethal Autonomous Weapons with Paul Scharre

Palus Astra

2y

0

25 Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming

Michaël Trazzi

3mo

0