Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
2237 posts
AI
AI Timelines
AI Takeoff
Careers
Audio
Infra-Bayesianism
DeepMind
Interviews
SERI MATS
Dialogue (format)
Agent Foundations
Redwood Research
358 posts
Iterated Amplification
Myopia
Factored Cognition
Humans Consulting HCH
Corrigibility
Interpretability (ML & AI)
Debate (AI safety technique)
Experiments
Self Fulfilling/Refuting Prophecies
Ought
Orthogonality Thesis
Instrumental Convergence
7
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
29
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
33
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
40
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
199
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
108
The next decades might be wild
Marius Hobbhahn
5d
21
15
Solution to The Alignment Problem
Algon
1d
0
95
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
22
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
54
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
39
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
207
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
73
Revisiting algorithmic progress
Tamay
7d
6
51
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
70
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
172
The Plan - 2022 Update
johnswentworth
19d
33
31
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
2
(Extremely) Naive Gradient Hacking Doesn't Work
ojorgensen
9h
0
46
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
57
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
50
Notes on OpenAI’s alignment plan
Alex Flint
12d
5
36
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
83
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
58
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
40
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
47
Re-Examining LayerNorm
Eric Winsor
19d
8