Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

57 posts Instrumental Convergence Deconfusion Orthogonality Thesis Gradient Hacking Gradient Descent

66 posts Corrigibility Treacherous Turn Mild Optimization Quantilization Satisficer Tripwire

2 (Extremely) Naive Gradient Hacking Doesn't Work

ojorgensen

9h

0

57 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

59 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

33 A caveat to the Orthogonality Thesis

Wuschel Schulz

1mo

10

4 Assessing the Capabilities of ChatGPT through Success Rates

Zachary Robertson

7d

0

38 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

22 Instrumental convergence in single-agent systems

Edouard Harris

2mo

4

2 The Opportunity and Risks of Learning Human Values In-Context

Zachary Robertson

10d

4

17 Misalignment-by-default in multi-agent systems

Edouard Harris

2mo

8

10 Is the Orthogonality Thesis true for humans?

Noosphere89

1mo

18

30 Gradient hacking: definitions and examples

Richard_Ngo

5mo

1

10 Instrumental convergence: scale and physical interactions

Edouard Harris

2mo

0

98 Coherence arguments imply a force for goal-directed behavior

KatjaGrace

1y

27

68 Gradient descent is not just more efficient genetic algorithms

leogao

1y

14

24 Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"

joraine

26d

11

38 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

17 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

108 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

127 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

113 A broad basin of attraction around human values?

Wei_Dai

8mo

16

60 Steam

abramdemski

6mo

9

134 Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res

1y

35

17 CHAI, Assistance Games, And Fully-Updated Deference [Scott Alexander]

berglund

2mo

1

27 Quantilizers and Generative Models

Adam Jermyn

5mo

5

77 Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

TurnTrout

1y

8

73 Corrigibility Can Be VNM-Incoherent

TurnTrout

1y

24

7 Simple question about corrigibility and values in AI.

jmh

1mo

1

27 [Intro to brain-like-AGI safety] 14. Controlled AGI

Steven Byrnes

7mo

25