Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
95 posts
Interpretability (ML & AI)
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Market making (AI safety technique)
Verification
28 posts
Myopia
Self Fulfilling/Refuting Prophecies
Luck
338
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
211
The Plan - 2022 Update
johnswentworth
19d
33
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
139
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers
lifelonglearner
1y
16
136
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"
Rob Bensinger
1y
13
136
A transparency and interpretability tech tree
evhub
6mo
10
135
Understanding “Deep Double Descent”
evhub
3y
51
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
123
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
118
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc
johnswentworth
6mo
52
118
The case for becoming a black-box investigator of language models
Buck
7mo
19
112
Conversation with Eliezer: What do you want the system to do?
Akash
5mo
38
106
A Longlist of Theories of Impact for Interpretability
Neel Nanda
9mo
29
99
Re-Examining LayerNorm
Eric Winsor
19d
8
144
Self-fulfilling correlations
PhilGoetz
12y
50
91
The Credit Assignment Problem
abramdemski
3y
40
58
Partial Agency
abramdemski
3y
18
57
Open Problems with Myopia
Mark Xu
1y
16
56
Arguments against myopic training
Richard_Ngo
2y
39
55
Why GPT wants to mesa-optimize & how we might change this
John_Maxwell
2y
32
55
AI safety via market making
evhub
2y
45
50
LCDT, A Myopic Decision Theory
adamShimi
1y
51
44
Towards a mechanistic understanding of corrigibility
evhub
3y
26
43
Acceptability Verification: A Research Agenda
David Udell
5mo
0
38
Fifty Shades of Self-Fulfilling Prophecy
PhilGoetz
8y
87
38
Bayesian Evolving-to-Extinction
abramdemski
2y
13
37
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
33
An example of self-fulfilling spurious proofs in UDT
cousin_it
10y
43