Go Back
You can't go any further
Choose this branch
meritocratic
regular
democratic
hot
top
alive
37 posts
Corrigibility
11 posts
Treacherous Turn
Tripwire
25
Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"
joraine
26d
11
32
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
111
Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth
4mo
8
13
Corrigibility Via Thought-Process Deference
Thane Ruthenis
26d
5
109
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
105
A broad basin of attraction around human values?
Wei_Dai
8mo
16
21
CHAI, Assistance Games, And Fully-Updated Deference [Scott Alexander]
berglund
2mo
1
64
Corrigibility Can Be VNM-Incoherent
TurnTrout
1y
24
6
Simple question about corrigibility and values in AI.
jmh
1mo
1
26
[Intro to brain-like-AGI safety] 14. Controlled AGI
Steven Byrnes
7mo
25
39
Solve Corrigibility Week
Logan Riggs
1y
21
16
On corrigibility and its basin
Donald Hobson
6mo
3
15
Infernal Corrigibility, Fiendishly Difficult
David Udell
6mo
1
23
Formalizing Policy-Modification Corrigibility
TurnTrout
1y
6
118
Soares, Tallinn, and Yudkowsky discuss AGI cognition
So8res
1y
35
23
[AN #165]: When large models are more likely to lie
Rohin Shah
1y
0
31
[Linkpost] Treacherous turns in the wild
Mark Xu
1y
6
73
A Gym Gridworld Environment for the Treacherous Turn
Michaël Trazzi
4y
9
17
Any work on honeypots (to detect treacherous turn attempts)?
David Scott Krueger (formerly: capybaralet)
2y
4
36
A toy model of the treacherous turn
Stuart_Armstrong
6y
13
16
Superintelligence 11: The treacherous turn
KatjaGrace
8y
50
14
Superintelligence 13: Capability control methods
KatjaGrace
8y
48
3
Corrigibility thoughts III: manipulating versus deceiving
Stuart_Armstrong
5y
0
3
Corrigibility thoughts II: the robot operator
Stuart_Armstrong
5y
2
2
Corrigibility thoughts I: caring about multiple things
Stuart_Armstrong
5y
0