Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
153 posts
Interpretability (ML & AI)
GPT
Machine Learning (ML)
DeepMind
OpenAI
Truth, Semantics, & Meaning
AI Success Models
Bounties & Prizes (active)
AI-assisted Alignment
Lottery Ticket Hypothesis
Computer Science
Honesty
168 posts
Conjecture (org)
Oracle AI
Myopia
Language Models
Refine
Deconfusion
Agency
AI Boxing (Containment)
Deceptive Alignment
Scaling Laws
Deception
Acausal Trade
15
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
132
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
307
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
70
Predicting GPU performance
Marius Hobbhahn
6d
24
239
The Plan - 2022 Update
johnswentworth
19d
33
142
Re-Examining LayerNorm
Eric Winsor
19d
8
96
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
33
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
20
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
34
An exploration of GPT-2's embedding weights
Adam Scherlis
7d
2
35
Reframing inner alignment
davidad
9d
13
62
[ASoT] Finetuning, RL, and GPT's world prior
Jozdien
18d
8
25
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
53
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
26
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
67
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
31
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
96
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
222
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
223
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
759
Simulators
janus
3mo
103
248
Mysteries of mode collapse
janus
1mo
35
35
Side-channels: input versus output
davidad
8d
9
97
Searching for Search
NicholasKees
22d
6
118
Conjecture Second Hiring Round
Connor Leahy
27d
0
98
What I Learned Running Refine
adamShimi
26d
5
123
Current themes in mechanistic interpretability research
Lee Sharkey
1mo
3
494
chinchilla's wild implications
nostalgebraist
4mo
114