Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
123 posts
Interpretability (ML & AI)
Myopia
Self Fulfilling/Refuting Prophecies
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Luck
Market making (AI safety technique)
Verification
123 posts
Instrumental Convergence
Corrigibility
Deconfusion
Orthogonality Thesis
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
Satisficer
Gradient Descent
Tripwire
230
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
174
Self-fulfilling correlations
PhilGoetz
12y
50
172
The Plan - 2022 Update
johnswentworth
19d
33
158
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"
Rob Bensinger
1y
13
149
A transparency and interpretability tech tree
evhub
6mo
10
138
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers
lifelonglearner
1y
16
130
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc
johnswentworth
6mo
52
111
Conversation with Eliezer: What do you want the system to do?
Akash
5mo
38
108
The case for becoming a black-box investigator of language models
Buck
7mo
19
106
Understanding “Deep Double Descent”
evhub
3y
51
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
103
The Credit Assignment Problem
abramdemski
3y
40
97
Search versus design
Alex Flint
2y
41
97
Gradations of Inner Alignment Obstacles
abramdemski
1y
22
171
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More
Ben Pace
3y
60
134
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
132
Sorting Pebbles Into Correct Heaps
Eliezer Yudkowsky
14y
109
127
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
125
Goal retention discussion with Eliezer
MaxTegmark
8y
26
113
A broad basin of attraction around human values?
Wei_Dai
8mo
16
112
Seeking Power is Often Convergently Instrumental in MDPs
TurnTrout
3y
38
108
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
98
Coherence arguments imply a force for goal-directed behavior
KatjaGrace
1y
27
97
Gradient hacking
evhub
3y
39
77
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability
TurnTrout
1y
8
73
Corrigibility Can Be VNM-Incoherent
TurnTrout
1y
24
73
Introducing Corrigibility (an FAI research subfield)
So8res
8y
28
72
Boeing 737 MAX MCAS as an agent corrigibility failure
shminux
3y
3