Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
620 posts
AI
Eliciting Latent Knowledge (ELK)
AI Robustness
Truthful AI
Autonomy and Choice
Intelligence Explosion
Social Media
Transcripts
114 posts
Infra-Bayesianism
Counterfactuals
Interviews
Audio
Logic & Mathematics
AXRP
Redwood Research
Domain Theory
Counterfactual Mugging
Newcomb's Problem
Formal Proof
Functional Decision Theory
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
251
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
12
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
53
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
182
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
29
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
129
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
76
Finding gliders in the game of life
paulfchristiano
19d
7
23
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
7d
3
503
(My understanding of) What Everyone in Technical Alignment is Doing and Why
Thomas Larsen
3mo
83
48
Verification Is Not Easier Than Generation In General
johnswentworth
14d
23
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
150
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
maxnadeau
1mo
14
99
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
KevinRoWang
1mo
5
25
Causal scrubbing: results on a paren balance checker
LawrenceC
17d
0
136
Takeaways from our robust injury classifier project [Redwood Research]
dmz
3mo
9
17
Causal scrubbing: Appendix
LawrenceC
17d
0
29
Vanessa Kosoy's PreDCA, distilled
Martín Soto
1mo
17
174
High-stakes alignment via adversarial training [Redwood Research report]
dmz
7mo
29
35
A conversation about Katja's counterarguments to AI risk
Matthew Barnett
2mo
9
57
Infra-Exercises, Part 1
Diffractor
3mo
9
30
Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming
Michaël Trazzi
3mo
0
98
Infra-Bayesian physicalism: a formal theory of naturalized induction
Vanessa Kosoy
1y
20
116
Redwood Research’s current project
Buck
1y
29
95
Why I'm excited about Redwood Research's current project
paulfchristiano
1y
6