Go Back
Choose this branch
You can't go any further
meritocratic
regular
democratic
hot
top
alive
30 posts
Outer Alignment
Mesa-Optimization
46 posts
Inner Alignment
60
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
87
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
13
The Disastrously Confident And Inaccurate AI
Sharat Jacob Jacob
1mo
0
30
The Speed + Simplicity Prior is probably anti-deceptive
7mo
29
166
Risks from Learned Optimization: Introduction
evhub
3y
42
23
Three questions about mesa-optimizers
Eric Neyman
8mo
5
24
[ASoT] Some thoughts about deceptive mesaoptimization
leogao
8mo
5
7
Inner alignment: what are we pointing at?
lcmgcd
3mo
2
11
Alignment as Game Design
Shoshannah Tekofsky
5mo
7
61
"Inner Alignment Failures" Which Are Actually Outer Alignment Failures
johnswentworth
2y
38
94
List of resolved confusions about IDA
Wei_Dai
3y
18
2
Don't you think RLHF solves outer alignment?
Raphaël S
1mo
19
12
Mesa-utility functions might not be purely proxy goals
Thomas Kwa
8mo
17
45
Mesa-Optimizers vs “Steered Optimizers”
Steven Byrnes
2y
7
96
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
35
Mesa-Optimizers via Grokking
orthonormal
14d
4
26
Take 8: Queer the inner/outer alignment dichotomy.
Charlie Steiner
11d
2
20
Value Formation: An Overarching Model
Thane Ruthenis
1mo
6
72
How likely is deceptive alignment?
evhub
3mo
21
21
Greed Is the Root of This Evil
Thane Ruthenis
2mo
4
36
Broad Picture of Human Values
Thane Ruthenis
4mo
5
43
Outer vs inner misalignment: three framings
Richard_Ngo
5mo
4
8
I there a demo of "You can't fetch the coffee if you're dead"?
Ram Rachum
1mo
9
103
Selection Theorems: A Program For Understanding Agents
johnswentworth
1y
23
175
Inner Alignment: Explain like I'm 12 Edition
Rafael Harth
2y
46
34
[Intro to brain-like-AGI safety] 10. The alignment problem
Steven Byrnes
8mo
4
27
Clarifying the confusion around inner alignment
Rauno Arike
7mo
0
12
Are Generative World Models a Mesa-Optimization Risk?
Thane Ruthenis
3mo
2