Go Back
You can't go any further
Choose this branch
meritocratic
regular
democratic
hot
top
alive
37 posts
Corrigibility
11 posts
Treacherous Turn
Tripwire
24
Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"
joraine
26d
11
38
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
17
Corrigibility Via Thought-Process Deference
Thane Ruthenis
26d
5
108
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
127
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
113
A broad basin of attraction around human values?
Wei_Dai
8mo
16
17
CHAI, Assistance Games, And Fully-Updated Deference [Scott Alexander]
berglund
2mo
1
73
Corrigibility Can Be VNM-Incoherent
TurnTrout
1y
24
7
Simple question about corrigibility and values in AI.
jmh
1mo
1
27
[Intro to brain-like-AGI safety] 14. Controlled AGI
Steven Byrnes
7mo
25
16
On corrigibility and its basin
Donald Hobson
6mo
3
37
Solve Corrigibility Week
Logan Riggs
1y
21
25
Formalizing Policy-Modification Corrigibility
TurnTrout
1y
6
9
Infernal Corrigibility, Fiendishly Difficult
David Udell
6mo
1
134
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
39
[Linkpost] Treacherous turns in the wild
Mark Xu
1y
6
26
[AN #165]: When large models are more likely to lie
Rohin Shah
1y
0
66
A Gym Gridworld Environment for the Treacherous Turn
Michaël Trazzi
4y
9
18
Any work on honeypots (to detect treacherous turn attempts)?
David Scott Krueger (formerly: capybaralet)
2y
4
39
A toy model of the treacherous turn
Stuart_Armstrong
6y
13
20
Superintelligence 11: The treacherous turn
KatjaGrace
8y
50
19
Superintelligence 13: Capability control methods
KatjaGrace
8y
48
4
Corrigibility thoughts III: manipulating versus deceiving
Stuart_Armstrong
5y
0
4
Corrigibility thoughts II: the robot operator
Stuart_Armstrong
5y
2
3
Corrigibility thoughts I: caring about multiple things
Stuart_Armstrong
5y
0