Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
2237 posts
AI
AI Timelines
AI Takeoff
Careers
Audio
Infra-Bayesianism
DeepMind
Interviews
SERI MATS
Dialogue (format)
Agent Foundations
Redwood Research
358 posts
Iterated Amplification
Myopia
Factored Cognition
Humans Consulting HCH
Corrigibility
Interpretability (ML & AI)
Debate (AI safety technique)
Experiments
Self Fulfilling/Refuting Prophecies
Ought
Orthogonality Thesis
Instrumental Convergence
84
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
41
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
5
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
198
The next decades might be wild
Marius Hobbhahn
5d
21
265
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
6
I believe some AI doomers are overconfident
FTPickle
6h
4
5
Career Scouting: Housing Coordination
koratkar
5h
0
13
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
71
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
19
Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)
Remmelt
1d
6
323
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
107
Okay, I feel it now
g1
7d
14
111
Revisiting algorithmic progress
Tamay
7d
6
89
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
6
(Extremely) Naive Gradient Hacking Doesn't Work
ojorgensen
9h
0
56
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
250
The Plan - 2022 Update
johnswentworth
19d
33
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
151
Re-Examining LayerNorm
Eric Winsor
19d
8
35
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
21
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
59
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
44
Notes on OpenAI’s alignment plan
Alex Flint
12d
5
26
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
26
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
56
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24