Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
246 posts
Interpretability (ML & AI)
Instrumental Convergence
Corrigibility
Myopia
Deconfusion
Self Fulfilling/Refuting Prophecies
Orthogonality Thesis
AI Success Models
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
112 posts
Iterated Amplification
Debate (AI safety technique)
Factored Cognition
Humans Consulting HCH
Experiments
Ought
AI-assisted Alignment
Memory and Mnemonics
Air Conditioning
Verification
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
6
(Extremely) Naive Gradient Hacking Doesn't Work
ojorgensen
9h
0
56
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
250
The Plan - 2022 Update
johnswentworth
19d
33
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
151
Re-Examining LayerNorm
Eric Winsor
19d
8
35
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
21
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
59
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
26
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
56
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
446
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
78
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
44
Notes on OpenAI’s alignment plan
Alex Flint
12d
5
26
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
11
Alignment with argument-networks and assessment-predictions
Tor Økland Barstad
7d
3
21
Research request (alignment strategy): Deep dive on "making AI solve alignment for us"
JanBrauner
19d
3
184
Godzilla Strategies
johnswentworth
6mo
65
105
Beliefs and Disagreements about Automating Alignment Research
Ian McKenzie
3mo
4
66
A Library and Tutorial for Factored Cognition with Language Models
stuhlmueller
2mo
0
52
Ought will host a factored cognition “Lab Meeting”
jungofthewon
3mo
1
62
Rant on Problem Factorization for Alignment
johnswentworth
4mo
48
17
Provably Honest - A First Step
Srijanak De
1mo
2
80
Air Conditioner Test Results & Discussion
johnswentworth
6mo
38
112
Preregistration: Air Conditioner Test
johnswentworth
8mo
64
120
Supervise Process, not Outcomes
stuhlmueller
8mo
8
52
A Small Negative Result on Debate
Sam Bowman
8mo
11