Go Back
You can't go any further
Choose this branch
meritocratic
regular
democratic
hot
top
alive
37 posts
Corrigibility
11 posts
Treacherous Turn
Tripwire
114
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
97
A broad basin of attraction around human values?
Wei_Dai
8mo
16
91
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
55
Corrigibility Can Be VNM-Incoherent
TurnTrout
1y
24
52
Corrigibility
paulfchristiano
4y
7
48
Boeing 737 MAX MCAS as an agent corrigibility failure
shminux
3y
3
41
Solve Corrigibility Week
Logan Riggs
1y
21
31
Do what we mean vs. do what we say
Rohin Shah
4y
14
31
Introducing Corrigibility (an FAI research subfield)
So8res
8y
28
29
Cake, or death!
Stuart_Armstrong
10y
13
29
Can corrigibility be learned safely?
Wei_Dai
4y
115
26
Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"
joraine
26d
11
26
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
25
[Intro to brain-like-AGI safety] 14. Controlled AGI
Steven Byrnes
7mo
25
102
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
80
A Gym Gridworld Environment for the Treacherous Turn
Michaël Trazzi
4y
9
33
A toy model of the treacherous turn
Stuart_Armstrong
6y
13
23
[Linkpost] Treacherous turns in the wild
Mark Xu
1y
6
20
[AN #165]: When large models are more likely to lie
Rohin Shah
1y
0
16
Any work on honeypots (to detect treacherous turn attempts)?
David Scott Krueger (formerly: capybaralet)
2y
4
12
Superintelligence 11: The treacherous turn
KatjaGrace
8y
50
9
Superintelligence 13: Capability control methods
KatjaGrace
8y
48
2
Corrigibility thoughts II: the robot operator
Stuart_Armstrong
5y
2
2
Corrigibility thoughts III: manipulating versus deceiving
Stuart_Armstrong
5y
0
1
Corrigibility thoughts I: caring about multiple things
Stuart_Armstrong
5y
0