Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
2237 posts
AI
AI Timelines
AI Takeoff
Careers
Audio
Infra-Bayesianism
DeepMind
Interviews
SERI MATS
Dialogue (format)
Agent Foundations
Redwood Research
358 posts
Iterated Amplification
Myopia
Factored Cognition
Humans Consulting HCH
Corrigibility
Interpretability (ML & AI)
Debate (AI safety technique)
Experiments
Self Fulfilling/Refuting Prophecies
Ought
Orthogonality Thesis
Instrumental Convergence
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
6
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
21
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
153
The next decades might be wild
Marius Hobbhahn
5d
21
232
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
55
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
92
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
3
I believe some AI doomers are overconfident
FTPickle
6h
4
265
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
11
Solution to The Alignment Problem
Algon
1d
0
92
Revisiting algorithmic progress
Tamay
7d
6
18
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
83
Okay, I feel it now
g1
7d
14
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
4
(Extremely) Naive Gradient Hacking Doesn't Work
ojorgensen
9h
0
211
The Plan - 2022 Update
johnswentworth
19d
33
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
58
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
99
Re-Examining LayerNorm
Eric Winsor
19d
8
36
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
47
Notes on OpenAI’s alignment plan
Alex Flint
12d
5
22
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
31
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24