Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1014 posts
AI
AI Timelines
Value Learning
AI Takeoff
Embedded Agency
Community
Eliciting Latent Knowledge (ELK)
Reinforcement Learning
Infra-Bayesianism
Counterfactuals
Logic & Mathematics
Interviews
111 posts
Iterated Amplification
Game Theory
Factored Cognition
Humans Consulting HCH
Research Agendas
Ought
Debate (AI safety technique)
Risks of Astronomical Suffering (S-risks)
Center on Long-Term Risk (CLR)
Mechanism Design
Fairness
Group Rationality
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
271
Reward is not the optimization target
TurnTrout
4mo
97
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
182
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
72
A shot at the diamond-alignment problem
TurnTrout
2mo
53
67
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
129
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
42
Four usages of "loss" in AI
TurnTrout
2mo
18
87
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
Akash
28d
20
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
251
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
43
In defense of probably wrong mechanistic models
evhub
14d
10
33
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
25
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
44
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
42
Notes on OpenAI’s alignment plan
Alex Flint
12d
5
285
On how various plans miss the hard bits of the alignment challenge
So8res
5mo
81
61
Rant on Problem Factorization for Alignment
johnswentworth
4mo
48
58
Relaxed adversarial training for inner alignment
evhub
3y
28
125
«Boundaries», Part 1: a key missing concept from utility theory
Andrew_Critch
4mo
26
80
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Diffractor
2mo
11
174
Unifying Bargaining Notions (1/2)
Diffractor
4mo
38
181
Some conceptual alignment research projects
Richard_Ngo
3mo
14
84
Unifying Bargaining Notions (2/2)
Diffractor
4mo
11
132
Paul's research agenda FAQ
zhukeepa
4y
73
49
A Small Negative Result on Debate
Sam Bowman
8mo
11