Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

57 posts Instrumental Convergence Deconfusion Orthogonality Thesis Gradient Hacking Gradient Descent

66 posts Corrigibility Treacherous Turn Mild Optimization Quantilization Satisficer Tripwire

4 (Extremely) Naive Gradient Hacking Doesn't Work

ojorgensen

9h

0

58 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

72 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

5 Assessing the Capabilities of ChatGPT through Success Rates

Zachary Robertson

7d

0

36 A caveat to the Orthogonality Thesis

Wuschel Schulz

1mo

10

36 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

27 Instrumental convergence in single-agent systems

Edouard Harris

2mo

4

17 Misalignment-by-default in multi-agent systems

Edouard Harris

2mo

8

12 Is the Orthogonality Thesis true for humans?

Noosphere89

1mo

18

15 Instrumental convergence: scale and physical interactions

Edouard Harris

2mo

0

5 [ASoT] Instrumental convergence is useful

Ulisse Mini

1mo

9

1 The Opportunity and Risks of Learning Human Values In-Context

Zachary Robertson

10d

4

205 Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace

3y

60

22 Gradient hacking: definitions and examples

Richard_Ngo

5mo

1

25 Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"

joraine

26d

11

32 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

111 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

13 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

109 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

105 A broad basin of attraction around human values?

Wei_Dai

8mo

16

61 Steam

abramdemski

6mo

9

21 CHAI, Assistance Games, And Fully-Updated Deference [Scott Alexander]

berglund

2mo

1

118 Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res

1y

35

24 Quantilizers and Generative Models

Adam Jermyn

5mo

5

69 Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

TurnTrout

1y

8

64 Corrigibility Can Be VNM-Incoherent

TurnTrout

1y

24

21 Exploring Mild Behaviour in Embedded Agents

Megan Kinniment

5mo

3

6 Simple question about corrigibility and values in AI.

jmh

1mo

1