Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
20 posts
Corrigibility
Treacherous Turn
Programming
2017-2019 AI Alignment Prize
Petrov Day
9 posts
Instrumental Convergence
Satisficer
LessWrong Event Transcripts
13
Corrigibility Via Thought-Process Deference
Thane Ruthenis
26d
5
109
Let's See You Write That Corrigibility Tag
Eliezer Yudkowsky
6mo
67
39
Solve Corrigibility Week
Logan Riggs
1y
21
16
On corrigibility and its basin
Donald Hobson
6mo
3
23
Formalizing Policy-Modification Corrigibility
TurnTrout
1y
6
23
[AN #165]: When large models are more likely to lie
Rohin Shah
1y
0
31
[Linkpost] Treacherous turns in the wild
Mark Xu
1y
6
93
Announcement: AI alignment prize round 3 winners and next round
cousin_it
4y
7
74
Announcement: AI alignment prize round 4 winners
cousin_it
3y
41
73
A Gym Gridworld Environment for the Treacherous Turn
Michaël Trazzi
4y
9
52
Corrigibility
paulfchristiano
4y
7
34
Do what we mean vs. do what we say
Rohin Shah
4y
14
35
Can corrigibility be learned safely?
Wei_Dai
4y
115
23
Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.
RyanCarey
4y
1
58
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
36
Empowerment is (almost) All We Need
jacob_cannell
1mo
43
69
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability
TurnTrout
1y
8
205
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More
Ben Pace
3y
60
71
Environmental Structure Can Cause Instrumental Convergence
TurnTrout
1y
44
153
Seeking Power is Often Convergently Instrumental in MDPs
TurnTrout
3y
38
23
MDP models are determined by the agent architecture and the environmental dynamics
TurnTrout
1y
34
42
Clarifying Power-Seeking and Instrumental Convergence
TurnTrout
3y
7
30
Power as Easily Exploitable Opportunities
TurnTrout
2y
5