Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
31 posts
Audio
Redwood Research
AXRP
Interviews
Adversarial Examples
Adversarial Training
AI Robustness
9 posts
Transcripts
Autonomous Weapons
130
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
134
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
maxnadeau
1mo
14
24
Latent Adversarial Training
Adam Jermyn
5mo
9
135
Takeaways from our robust injury classifier project [Redwood Research]
dmz
3mo
9
86
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
KevinRoWang
1mo
5
136
High-stakes alignment via adversarial training [Redwood Research report]
dmz
7mo
29
48
Redwood's Technique-Focused Epistemic Strategy
adamShimi
1y
1
16
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
4mo
0
12
AXRP Episode 1 - Adversarial Policies with Adam Gleave
DanielFilan
1y
5
13
AXRP Episode 2 - Learning Human Biases with Rohin Shah
DanielFilan
1y
0
34
AXRP Episode 7 - Side Effects with Victoria Krakovna
DanielFilan
1y
6
10
AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong
DanielFilan
3mo
1
33
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck
6mo
0
41
AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger
DanielFilan
1y
10
23
[AN #146]: Plausible stories of how we might fail to avert an existential catastrophe
Rohin Shah
1y
1
24
AXRP Episode 7.5 - Forecasting Transformative AI from Biological Anchors with Ajeya Cotra
DanielFilan
1y
1
40
Rohin Shah on reasons for AI optimism
abergal
3y
58
43
A conversation about Katja's counterarguments to AI risk
Matthew Barnett
2mo
9
35
Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI
Palus Astra
2y
4
44
Conversation with Paul Christiano
abergal
3y
6
56
AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant
DanielFilan
1y
2
11
AI Alignment Podcast: On Lethal Autonomous Weapons with Paul Scharre
Palus Astra
2y
0
25
Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming
Michaël Trazzi
3mo
0