Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
153 posts
Interpretability (ML & AI)
GPT
Machine Learning (ML)
DeepMind
OpenAI
Truth, Semantics, & Meaning
AI Success Models
Bounties & Prizes (active)
AI-assisted Alignment
Lottery Ticket Hypothesis
Computer Science
Honesty
168 posts
Conjecture (org)
Oracle AI
Myopia
Language Models
Refine
Deconfusion
Agency
AI Boxing (Containment)
Deceptive Alignment
Scaling Laws
Deception
Acausal Trade
11
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
59
Reframing inner alignment
davidad
9d
13
114
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
32
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
183
The Plan - 2022 Update
johnswentworth
19d
33
223
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
61
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
90
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
199
Common misconceptions about OpenAI
Jacob_Hilton
3mo
138
51
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability
David Scott Krueger (formerly: capybaralet)
1mo
25
11
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
28
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda
1mo
15
64
Clarifying AI X-risk
zac_kenton
1mo
23
18
An exploration of GPT-2's embedding weights
Adam Scherlis
7d
2
28
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
64
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
35
Side-channels: input versus output
davidad
8d
9
43
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
96
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
185
Simulators
janus
3mo
103
41
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
145
Decision theory does not imply that we get to have nice things
So8res
2mo
53
178
Mysteries of mode collapse
janus
1mo
35
28
Inverse scaling can become U-shaped
Edouard Harris
1mo
15
143
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
108
What I Learned Running Refine
adamShimi
26d
5
40
Paper: Large Language Models Can Self-improve [Linkpost]
Evan R. Murphy
2mo
14
64
How likely is deceptive alignment?
evhub
3mo
21