Go Back
Choose this branch
You can't go any further
meritocratic
regular
democratic
hot
top
alive
30 posts
Outer Alignment
Mesa-Optimization
46 posts
Inner Alignment
59
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
93
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
28
The Disastrously Confident And Inaccurate AI
Sharat Jacob Jacob
1mo
0
5
Don't you think RLHF solves outer alignment?
Raphaƫl S
1mo
19
28
The Speed + Simplicity Prior is probably anti-deceptive
7mo
29
195
Risks from Learned Optimization: Introduction
evhub
3y
42
15
Alignment as Game Design
Shoshannah Tekofsky
5mo
7
8
Inner alignment: what are we pointing at?
lcmgcd
3mo
2
24
Three questions about mesa-optimizers
Eric Neyman
8mo
5
25
[ASoT] Some thoughts about deceptive mesaoptimization
leogao
8mo
5
5
Planning capacity and daemons
lcmgcd
2mo
0
14
Mesa-utility functions might not be purely proxy goals
Thomas Kwa
8mo
17
46
"Inner Alignment Failures" Which Are Actually Outer Alignment Failures
johnswentworth
2y
38
72
List of resolved confusions about IDA
Wei_Dai
3y
18
108
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
24
Take 8: Queer the inner/outer alignment dichotomy.
Charlie Steiner
11d
2
29
Mesa-Optimizers via Grokking
orthonormal
14d
4
84
How likely is deceptive alignment?
evhub
3mo
21
21
Value Formation: An Overarching Model
Thane Ruthenis
1mo
6
20
I there a demo of "You can't fetch the coffee if you're dead"?
Ram Rachum
1mo
9
20
Greed Is the Root of This Evil
Thane Ruthenis
2mo
4
36
Broad Picture of Human Values
Thane Ruthenis
4mo
5
44
Outer vs inner misalignment: three framings
Richard_Ngo
5mo
4
112
Selection Theorems: A Program For Understanding Agents
johnswentworth
1y
23
185
Inner Alignment: Explain like I'm 12 Edition
Rafael Harth
2y
46
19
Deception as the optimal: mesa-optimizers and inner alignment
Eleni Angelou
4mo
0
14
Are Generative World Models a Mesa-Optimization Risk?
Thane Ruthenis
3mo
2
26
Clarifying the confusion around inner alignment
Rauno Arike
7mo
0