Go Back
Choose this branch
You can't go any further
meritocratic
regular
democratic
hot
top
alive
30 posts
Outer Alignment
Mesa-Optimization
46 posts
Inner Alignment
166
Risks from Learned Optimization: Introduction
evhub
3y
42
94
List of resolved confusions about IDA
Wei_Dai
3y
18
87
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
62
An Increasingly Manipulative Newsfeed
Michaël Trazzi
3y
16
61
"Inner Alignment Failures" Which Are Actually Outer Alignment Failures
johnswentworth
2y
38
60
Prize for probable problems
paulfchristiano
4y
63
60
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
54
[AN #58] Mesa optimization: what it is, and why we should care
Rohin Shah
3y
9
50
Weak arguments against the universal prior being malign
X4vier
4y
23
45
Mesa-Optimizers vs “Steered Optimizers”
Steven Byrnes
2y
7
43
The Steering Problem
paulfchristiano
4y
12
30
The Speed + Simplicity Prior is probably anti-deceptive
7mo
29
24
Outer alignment and imitative amplification
evhub
2y
11
24
[ASoT] Some thoughts about deceptive mesaoptimization
leogao
8mo
5
175
Inner Alignment: Explain like I'm 12 Edition
Rafael Harth
2y
46
103
Selection Theorems: A Program For Understanding Agents
johnswentworth
1y
23
103
Demons in Imperfect Search
johnswentworth
2y
21
99
The Inner Alignment Problem
evhub
3y
17
96
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
87
Tessellating Hills: a toy model for demons in imperfect search
DaemonicSigil
2y
17
81
Open question: are minimal circuits daemon-free?
paulfchristiano
4y
70
77
2-D Robustness
vlad_m
3y
8
72
How likely is deceptive alignment?
evhub
3mo
21
70
A simple environment for showing mesa misalignment
Matthew Barnett
3y
9
70
Discussion: Objective Robustness and Inner Alignment Terminology
jbkjr
1y
7
63
Empirical Observations of Objective Robustness Failures
jbkjr
1y
5
63
Concrete experiments in inner alignment
evhub
3y
12
54
Mesa-Search vs Mesa-Control
abramdemski
2y
45