Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
246 posts
Interpretability (ML & AI)
Instrumental Convergence
Corrigibility
Myopia
Deconfusion
Self Fulfilling/Refuting Prophecies
Orthogonality Thesis
AI Success Models
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
112 posts
Iterated Amplification
Debate (AI safety technique)
Factored Cognition
Humans Consulting HCH
Experiments
Ought
AI-assisted Alignment
Memory and Mnemonics
Air Conditioning
Verification
446
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
250
The Plan - 2022 Update
johnswentworth
19d
33
239
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More
Ben Pace
3y
60
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
208
Sorting Pebbles Into Correct Heaps
Eliezer Yudkowsky
14y
109
201
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
194
Seeking Power is Often Convergently Instrumental in MDPs
TurnTrout
3y
38
164
Understanding “Deep Double Descent”
evhub
3y
51
151
Re-Examining LayerNorm
Eric Winsor
19d
8
140
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers
lifelonglearner
1y
16
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
134
Circumventing interpretability: How to defeat mind-readers
Lee Sharkey
5mo
8
131
A Longlist of Theories of Impact for Interpretability
Neel Nanda
9mo
29
128
The case for becoming a black-box investigator of language models
Buck
7mo
19
184
Godzilla Strategies
johnswentworth
6mo
65
139
Paul's research agenda FAQ
zhukeepa
4y
73
121
My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda
Chi Nguyen
2y
21
120
Supervise Process, not Outcomes
stuhlmueller
8mo
8
118
Debate update: Obfuscated arguments problem
Beth Barnes
1y
21
112
Preregistration: Air Conditioner Test
johnswentworth
8mo
64
105
Beliefs and Disagreements about Automating Alignment Research
Ian McKenzie
3mo
4
98
Ought: why it matters and ways to help
paulfchristiano
3y
7
97
Writeup: Progress on AI Safety via Debate
Beth Barnes
2y
18
89
Solving Math Problems by Relay
bgold
2y
26
82
Imitative Generalisation (AKA 'Learning the Prior')
Beth Barnes
1y
14
80
Air Conditioner Test Results & Discussion
johnswentworth
6mo
38
74
A guide to Iterated Amplification & Debate
Rafael Harth
2y
10
69
Why I'm excited about Debate
Richard_Ngo
1y
12