Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
95 posts
Interpretability (ML & AI)
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Market making (AI safety technique)
Verification
28 posts
Myopia
Self Fulfilling/Refuting Prophecies
Luck
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
56
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
250
The Plan - 2022 Update
johnswentworth
19d
33
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
151
Re-Examining LayerNorm
Eric Winsor
19d
8
35
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
21
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
26
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
56
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
446
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
78
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
80
Engineering Monosemanticity in Toy Models
Adam Jermyn
1mo
6
201
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
34
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
48
Acceptability Verification: A Research Agenda
David Udell
5mo
0
15
Generative, Episodic Objectives for Safe AI
Michael Glass
2mo
3
53
LCDT, A Myopic Decision Theory
adamShimi
1y
51
60
Open Problems with Myopia
Mark Xu
1y
16
79
The Credit Assignment Problem
abramdemski
3y
40
56
AI safety via market making
evhub
2y
45
47
Why GPT wants to mesa-optimize & how we might change this
John_Maxwell
2y
32
47
Arguments against myopic training
Richard_Ngo
2y
39
13
Encouragement to Instill Confidence?
Zvi
10mo
5
56
Partial Agency
abramdemski
3y
18
21
Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability
Michaƫl Trazzi
1y
0
46
Towards a mechanistic understanding of corrigibility
evhub
3y
26
32
Random Thoughts on Predict-O-Matic
abramdemski
3y
3