Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
95 posts
Interpretability (ML & AI)
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Market making (AI safety technique)
Verification
28 posts
Myopia
Self Fulfilling/Refuting Prophecies
Luck
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
211
The Plan - 2022 Update
johnswentworth
19d
33
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
99
Re-Examining LayerNorm
Eric Winsor
19d
8
22
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
31
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
72
Engineering Monosemanticity in Toy Models
Adam Jermyn
1mo
6
338
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
60
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
24
Subsets and quotients in interpretability
Erik Jenner
18d
1
37
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
43
Acceptability Verification: A Research Agenda
David Udell
5mo
0
11
Generative, Episodic Objectives for Safe AI
Michael Glass
2mo
3
50
LCDT, A Myopic Decision Theory
adamShimi
1y
51
57
Open Problems with Myopia
Mark Xu
1y
16
91
The Credit Assignment Problem
abramdemski
3y
40
55
Why GPT wants to mesa-optimize & how we might change this
John_Maxwell
2y
32
56
Arguments against myopic training
Richard_Ngo
2y
39
16
Encouragement to Instill Confidence?
Zvi
10mo
5
55
AI safety via market making
evhub
2y
45
28
Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability
Michaƫl Trazzi
1y
0
58
Partial Agency
abramdemski
3y
18
38
Bayesian Evolving-to-Extinction
abramdemski
2y
13
44
Towards a mechanistic understanding of corrigibility
evhub
3y
26