Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
123 posts
Interpretability (ML & AI)
Myopia
Self Fulfilling/Refuting Prophecies
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Luck
Market making (AI safety technique)
Verification
123 posts
Instrumental Convergence
Corrigibility
Deconfusion
Orthogonality Thesis
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
Satisficer
Gradient Descent
Tripwire
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
211
The Plan - 2022 Update
johnswentworth
19d
33
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
99
Re-Examining LayerNorm
Eric Winsor
19d
8
22
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
31
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
37
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
72
Engineering Monosemanticity in Toy Models
Adam Jermyn
1mo
6
338
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
60
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
4
(Extremely) Naive Gradient Hacking Doesn't Work
ojorgensen
9h
0
58
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
72
Instrumental convergence is what makes general intelligence possible
tailcalled
1mo
11
25
Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"
joraine
26d
11
5
Assessing the Capabilities of ChatGPT through Success Rates
Zachary Robertson
7d
0
36
A caveat to the Orthogonality Thesis
Wuschel Schulz
1mo
10
32
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
111
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
36
Empowerment is (almost) All We Need
jacob_cannell
1mo
43
13
Corrigibility Via Thought-Process Deference
Thane Ruthenis
26d
5
109
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
27
Instrumental convergence in single-agent systems
Edouard Harris
2mo
4
105
A broad basin of attraction around human values?
Wei_Dai
8mo
16
61
Steam
abramdemski
6mo
9