Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
13 posts
Audio
Adversarial Examples
AXRP
18 posts
Redwood Research
Adversarial Training
AI Robustness
31
AXRP Episode 12 - AI Existential Risk with Paul Christiano
DanielFilan
1y
0
32
AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger
DanielFilan
1y
10
21
AXRP Episode 10 - AI’s Future and Impacts with Katja Grace
DanielFilan
1y
2
22
AXRP Episode 7 - Side Effects with Victoria Krakovna
DanielFilan
1y
6
18
AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell
DanielFilan
1y
1
20
AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes
DanielFilan
1y
3
34
If I were a well-intentioned AI... I: Image classifier
Stuart_Armstrong
2y
4
12
AXRP Episode 11 - Attainable Utility and Power with Alex Turner
DanielFilan
1y
5
15
AXRP Episode 3 - Negotiable Reinforcement Learning with Andrew Critch
DanielFilan
1y
0
22
[AN #62] Are adversarial examples caused by real but imperceptible features?
Rohin Shah
3y
10
8
AXRP Episode 2 - Learning Human Biases with Rohin Shah
DanielFilan
1y
0
8
AXRP Episode 1 - Adversarial Policies with Adam Gleave
DanielFilan
1y
5
8
The Goodhart Game
John_Maxwell
3y
5
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
150
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
maxnadeau
1mo
14
99
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
KevinRoWang
1mo
5
25
Causal scrubbing: results on a paren balance checker
LawrenceC
17d
0
136
Takeaways from our robust injury classifier project [Redwood Research]
dmz
3mo
9
17
Causal scrubbing: Appendix
LawrenceC
17d
0
174
High-stakes alignment via adversarial training [Redwood Research report]
dmz
7mo
29
116
Redwood Research’s current project
Buck
1y
29
95
Why I'm excited about Redwood Research's current project
paulfchristiano
1y
6
30
Latent Adversarial Training
Adam Jermyn
5mo
9
34
AXRP Episode 15 - Natural Abstractions with John Wentworth
DanielFilan
7mo
1
27
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck
6mo
0
14
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
4mo
0
11
AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong
DanielFilan
3mo
1