Tree of Tags

Go Back

You can't go any further

Choose this branch

meritocratic regular democratic

hot top alive

37 posts Corrigibility

11 posts Treacherous Turn Tripwire

127 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

113 A broad basin of attraction around human values?

Wei_Dai

8mo

16

108 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

73 Corrigibility Can Be VNM-Incoherent

TurnTrout

1y

24

73 Introducing Corrigibility (an FAI research subfield)

So8res

8y

28

72 Boeing 737 MAX MCAS as an agent corrigibility failure

shminux

3y

3

63 Cake, or death!

Stuart_Armstrong

10y

13

52 Corrigibility

paulfchristiano

4y

7

41 Can corrigibility be learned safely?

Wei_Dai

4y

115

38 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

37 Do what we mean vs. do what we say

Rohin Shah

4y

14

37 Solve Corrigibility Week

Logan Riggs

1y

21

36 The limits of corrigibility

Stuart_Armstrong

4y

9

34 Corrigible but misaligned: a superintelligent messiah

zhukeepa

4y

26

134 Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res

1y

35

66 A Gym Gridworld Environment for the Treacherous Turn

Michaël Trazzi

4y

9

39 [Linkpost] Treacherous turns in the wild

Mark Xu

1y

6

39 A toy model of the treacherous turn

Stuart_Armstrong

6y

13

26 [AN #165]: When large models are more likely to lie

Rohin Shah

1y

0

20 Superintelligence 11: The treacherous turn

KatjaGrace

8y

50

19 Superintelligence 13: Capability control methods

KatjaGrace

8y

48

18 Any work on honeypots (to detect treacherous turn attempts)?

David Scott Krueger (formerly: capybaralet)

2y

4

4 Corrigibility thoughts II: the robot operator

Stuart_Armstrong

5y

2

4 Corrigibility thoughts III: manipulating versus deceiving

Stuart_Armstrong

5y

0

3 Corrigibility thoughts I: caring about multiple things

Stuart_Armstrong

5y

0