Go Back
You can't go any further
Choose this branch
meritocratic
regular
democratic
hot
top
alive
37 posts
Corrigibility
11 posts
Treacherous Turn
Tripwire
111
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
109
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
105
A broad basin of attraction around human values?
Wei_Dai
8mo
16
64
Corrigibility Can Be VNM-Incoherent
TurnTrout
1y
24
60
Boeing 737 MAX MCAS as an agent corrigibility failure
shminux
3y
3
52
Corrigibility
paulfchristiano
4y
7
52
Introducing Corrigibility (an FAI research subfield)
So8res
8y
28
46
Cake, or death!
Stuart_Armstrong
10y
13
39
Solve Corrigibility Week
Logan Riggs
1y
21
35
Can corrigibility be learned safely?
Wei_Dai
4y
115
34
Do what we mean vs. do what we say
Rohin Shah
4y
14
32
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
28
Corrigible but misaligned: a superintelligent messiah
zhukeepa
4y
26
27
The limits of corrigibility
Stuart_Armstrong
4y
9
118
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
73
A Gym Gridworld Environment for the Treacherous Turn
Michaël Trazzi
4y
9
36
A toy model of the treacherous turn
Stuart_Armstrong
6y
13
31
[Linkpost] Treacherous turns in the wild
Mark Xu
1y
6
23
[AN #165]: When large models are more likely to lie
Rohin Shah
1y
0
17
Any work on honeypots (to detect treacherous turn attempts)?
David Scott Krueger (formerly: capybaralet)
2y
4
16
Superintelligence 11: The treacherous turn
KatjaGrace
8y
50
14
Superintelligence 13: Capability control methods
KatjaGrace
8y
48
3
Corrigibility thoughts II: the robot operator
Stuart_Armstrong
5y
2
3
Corrigibility thoughts III: manipulating versus deceiving
Stuart_Armstrong
5y
0
2
Corrigibility thoughts I: caring about multiple things
Stuart_Armstrong
5y
0