Tree of Tags

Go Back

Choose this branch

You can't go any further

meritocratic regular democratic

hot top alive

13 posts Redwood Research Adversarial Training AI Robustness

5 posts AXRP

174 High-stakes alignment via adversarial training [Redwood Research report]

dmz

7mo

29

154 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

150 Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau

1mo

14

136 Takeaways from our robust injury classifier project [Redwood Research]

dmz

3mo

9

116 Redwood Research’s current project

Buck

1y

29

99 Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

KevinRoWang

1mo

5

95 Why I'm excited about Redwood Research's current project

paulfchristiano

1y

6

42 Redwood's Technique-Focused Epistemic Strategy

adamShimi

1y

1

30 Latent Adversarial Training

Adam Jermyn

5mo

9

27 Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

Buck

6mo

0

25 Causal scrubbing: results on a paren balance checker

LawrenceC

17d

0

17 Causal scrubbing: Appendix

LawrenceC

17d

0

14 AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

DanielFilan

4mo

0

34 AXRP Episode 15 - Natural Abstractions with John Wentworth

DanielFilan

7mo

1

22 AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo

DanielFilan

8mo

1

20 AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

DanielFilan

8mo

9

11 AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong

DanielFilan

3mo

1

10 AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving

DanielFilan

5mo

0