Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
95 posts
Interpretability (ML & AI)
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Market making (AI safety technique)
Verification
28 posts
Myopia
Self Fulfilling/Refuting Prophecies
Luck
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
70
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
172
The Plan - 2022 Update
johnswentworth
19d
33
31
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
36
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
83
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
58
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
47
Re-Examining LayerNorm
Eric Winsor
19d
8
64
Engineering Monosemanticity in Toy Models
Adam Jermyn
1mo
6
9
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
230
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
42
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
18
Subsets and quotients in interpretability
Erik Jenner
18d
1
40
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
38
Acceptability Verification: A Research Agenda
David Udell
5mo
0
7
Generative, Episodic Objectives for Safe AI
Michael Glass
2mo
3
47
LCDT, A Myopic Decision Theory
adamShimi
1y
51
103
The Credit Assignment Problem
abramdemski
3y
40
54
Open Problems with Myopia
Mark Xu
1y
16
63
Why GPT wants to mesa-optimize & how we might change this
John_Maxwell
2y
32
19
Encouragement to Instill Confidence?
Zvi
10mo
5
65
Arguments against myopic training
Richard_Ngo
2y
39
35
Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability
Michaƫl Trazzi
1y
0
54
AI safety via market making
evhub
2y
45
60
Partial Agency
abramdemski
3y
18
48
Bayesian Evolving-to-Extinction
abramdemski
2y
13
42
Towards a mechanistic understanding of corrigibility
evhub
3y
26