Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
123 posts
Interpretability (ML & AI)
Myopia
Self Fulfilling/Refuting Prophecies
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Luck
Market making (AI safety technique)
Verification
123 posts
Instrumental Convergence
Corrigibility
Deconfusion
Orthogonality Thesis
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
Satisficer
Gradient Descent
Tripwire
446
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
250
The Plan - 2022 Update
johnswentworth
19d
33
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
201
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
164
Understanding “Deep Double Descent”
evhub
3y
51
151
Re-Examining LayerNorm
Eric Winsor
19d
8
140
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers
lifelonglearner
1y
16
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
134
Circumventing interpretability: How to defeat mind-readers
Lee Sharkey
5mo
8
131
A Longlist of Theories of Impact for Interpretability
Neel Nanda
9mo
29
128
The case for becoming a black-box investigator of language models
Buck
7mo
19
123
A transparency and interpretability tech tree
evhub
6mo
10
114
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"
Rob Bensinger
1y
13
114
Self-fulfilling correlations
PhilGoetz
12y
50
239
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More
Ben Pace
3y
60
208
Sorting Pebbles Into Correct Heaps
Eliezer Yudkowsky
14y
109
194
Seeking Power is Often Convergently Instrumental in MDPs
TurnTrout
3y
38
114
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
102
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
101
Gradient hacking
evhub
3y
39
97
A broad basin of attraction around human values?
Wei_Dai
8mo
16
91
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
85
Instrumental convergence is what makes general intelligence possible
tailcalled
1mo
11
80
A Gym Gridworld Environment for the Treacherous Turn
Michaël Trazzi
4y
9
78
Coherence arguments imply a force for goal-directed behavior
KatjaGrace
1y
27
69
Looking Deeper at Deconfusion
adamShimi
1y
13
62
Steam
abramdemski
6mo
9
61
Goal retention discussion with Eliezer
MaxTegmark
8y
26