Go Back
You can't go any further
Choose this branch
meritocratic
regular
democratic
hot
top
alive
37 posts
Corrigibility
11 posts
Treacherous Turn
Tripwire
127
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
113
A broad basin of attraction around human values?
Wei_Dai
8mo
16
108
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
73
Corrigibility Can Be VNM-Incoherent
TurnTrout
1y
24
73
Introducing Corrigibility (an FAI research subfield)
So8res
8y
28
72
Boeing 737 MAX MCAS as an agent corrigibility failure
shminux
3y
3
63
Cake, or death!
Stuart_Armstrong
10y
13
52
Corrigibility
paulfchristiano
4y
7
41
Can corrigibility be learned safely?
Wei_Dai
4y
115
38
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
37
Do what we mean vs. do what we say
Rohin Shah
4y
14
37
Solve Corrigibility Week
Logan Riggs
1y
21
36
The limits of corrigibility
Stuart_Armstrong
4y
9
34
Corrigible but misaligned: a superintelligent messiah
zhukeepa
4y
26
134
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
66
A Gym Gridworld Environment for the Treacherous Turn
Michaël Trazzi
4y
9
39
[Linkpost] Treacherous turns in the wild
Mark Xu
1y
6
39
A toy model of the treacherous turn
Stuart_Armstrong
6y
13
26
[AN #165]: When large models are more likely to lie
Rohin Shah
1y
0
20
Superintelligence 11: The treacherous turn
KatjaGrace
8y
50
19
Superintelligence 13: Capability control methods
KatjaGrace
8y
48
18
Any work on honeypots (to detect treacherous turn attempts)?
David Scott Krueger (formerly: capybaralet)
2y
4
4
Corrigibility thoughts II: the robot operator
Stuart_Armstrong
5y
2
4
Corrigibility thoughts III: manipulating versus deceiving
Stuart_Armstrong
5y
0
3
Corrigibility thoughts I: caring about multiple things
Stuart_Armstrong
5y
0