Go Back
Choose this branch
You can't go any further
meritocratic
regular
democratic
hot
top
alive
30 posts
Outer Alignment
Mesa-Optimization
46 posts
Inner Alignment
137
Risks from Learned Optimization: Introduction
evhub
3y
42
116
List of resolved confusions about IDA
Wei_Dai
3y
18
81
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
76
"Inner Alignment Failures" Which Are Actually Outer Alignment Failures
johnswentworth
2y
38
71
An Increasingly Manipulative Newsfeed
Michaël Trazzi
3y
16
71
Prize for probable problems
paulfchristiano
4y
63
61
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
58
Weak arguments against the universal prior being malign
X4vier
4y
23
47
[AN #58] Mesa optimization: what it is, and why we should care
Rohin Shah
3y
9
44
The Steering Problem
paulfchristiano
4y
12
38
Mesa-Optimizers vs “Steered Optimizers”
Steven Byrnes
2y
7
32
The Speed + Simplicity Prior is probably anti-deceptive
7mo
29
31
Outer alignment and imitative amplification
evhub
2y
11
23
[ASoT] Some thoughts about deceptive mesaoptimization
leogao
8mo
5
165
Inner Alignment: Explain like I'm 12 Edition
Rafael Harth
2y
46
108
Demons in Imperfect Search
johnswentworth
2y
21
94
Selection Theorems: A Program For Understanding Agents
johnswentworth
1y
23
93
The Inner Alignment Problem
evhub
3y
17
92
Tessellating Hills: a toy model for demons in imperfect search
DaemonicSigil
2y
17
84
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
84
Open question: are minimal circuits daemon-free?
paulfchristiano
4y
70
77
Discussion: Objective Robustness and Inner Alignment Terminology
jbkjr
1y
7
74
2-D Robustness
vlad_m
3y
8
73
Mesa-Search vs Mesa-Control
abramdemski
2y
45
70
Concrete experiments in inner alignment
evhub
3y
12
68
Empirical Observations of Objective Robustness Failures
jbkjr
1y
5
65
A simple environment for showing mesa misalignment
Matthew Barnett
3y
9
60
How likely is deceptive alignment?
evhub
3mo
21