Go Back
Choose this branch
You can't go any further
meritocratic
regular
democratic
hot
top
alive
13 posts
Redwood Research
Adversarial Training
AI Robustness
5 posts
AXRP
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
150
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
maxnadeau
1mo
14
30
Latent Adversarial Training
Adam Jermyn
5mo
9
136
Takeaways from our robust injury classifier project [Redwood Research]
dmz
3mo
9
99
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
KevinRoWang
1mo
5
174
High-stakes alignment via adversarial training [Redwood Research report]
dmz
7mo
29
42
Redwood's Technique-Focused Epistemic Strategy
adamShimi
1y
1
14
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
4mo
0
27
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck
6mo
0
116
Redwood Research’s current project
Buck
1y
29
17
Causal scrubbing: Appendix
LawrenceC
17d
0
25
Causal scrubbing: results on a paren balance checker
LawrenceC
17d
0
95
Why I'm excited about Redwood Research's current project
paulfchristiano
1y
6
11
AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong
DanielFilan
3mo
1
34
AXRP Episode 15 - Natural Abstractions with John Wentworth
DanielFilan
7mo
1
20
AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy
DanielFilan
8mo
9
10
AXRP Episode 16 - Preparing for Debate AI with Geoffrey Irving
DanielFilan
5mo
0
22
AXRP Episode 13 - First Principles of AGI Safety with Richard Ngo
DanielFilan
8mo
1