Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
246 posts
Interpretability (ML & AI)
Instrumental Convergence
Corrigibility
Myopia
Deconfusion
Self Fulfilling/Refuting Prophecies
Orthogonality Thesis
AI Success Models
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
112 posts
Iterated Amplification
Debate (AI safety technique)
Factored Cognition
Humans Consulting HCH
Experiments
Ought
AI-assisted Alignment
Memory and Mnemonics
Air Conditioning
Verification
230
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
174
Self-fulfilling correlations
PhilGoetz
12y
50
172
The Plan - 2022 Update
johnswentworth
19d
33
171
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More
Ben Pace
3y
60
158
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"
Rob Bensinger
1y
13
149
A transparency and interpretability tech tree
evhub
6mo
10
138
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers
lifelonglearner
1y
16
134
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
132
Sorting Pebbles Into Correct Heaps
Eliezer Yudkowsky
14y
109
130
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc
johnswentworth
6mo
52
127
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
125
Goal retention discussion with Eliezer
MaxTegmark
8y
26
113
A broad basin of attraction around human values?
Wei_Dai
8mo
16
112
Seeking Power is Often Convergently Instrumental in MDPs
TurnTrout
3y
38
132
Debate update: Obfuscated arguments problem
Beth Barnes
1y
21
118
Godzilla Strategies
johnswentworth
6mo
65
117
My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda
Chi Nguyen
2y
21
116
Supervise Process, not Outcomes
stuhlmueller
8mo
8
111
Paul's research agenda FAQ
zhukeepa
4y
73
107
Solving Math Problems by Relay
bgold
2y
26
106
Preregistration: Air Conditioner Test
johnswentworth
8mo
64
102
Imitative Generalisation (AKA 'Learning the Prior')
Beth Barnes
1y
14
91
Writeup: Progress on AI Safety via Debate
Beth Barnes
2y
18
85
Model splintering: moving from one imperfect model to another
Stuart_Armstrong
2y
10
84
Rant on Problem Factorization for Alignment
johnswentworth
4mo
48
80
Air Conditioner Test Results & Discussion
johnswentworth
6mo
38
79
Beliefs and Disagreements about Automating Alignment Research
Ian McKenzie
3mo
4
77
Why I'm excited about Debate
Richard_Ngo
1y
12