Tree of Tags

Go Back

Choose this branch

You can't go any further

meritocratic regular democratic

hot top alive

30 posts Outer Alignment Mesa-Optimization

46 posts Inner Alignment

60 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

87 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

13 The Disastrously Confident And Inaccurate AI

Sharat Jacob Jacob

1mo

0

30 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

166 Risks from Learned Optimization: Introduction

evhub

3y

42

23 Three questions about mesa-optimizers

Eric Neyman

8mo

5

24 [ASoT] Some thoughts about deceptive mesaoptimization

leogao

8mo

5

7 Inner alignment: what are we pointing at?

lcmgcd

3mo

2

11 Alignment as Game Design

Shoshannah Tekofsky

5mo

7

61 "Inner Alignment Failures" Which Are Actually Outer Alignment Failures

johnswentworth

2y

38

94 List of resolved confusions about IDA

Wei_Dai

3y

18

2 Don't you think RLHF solves outer alignment?

Raphaël S

1mo

19

12 Mesa-utility functions might not be purely proxy goals

Thomas Kwa

8mo

17

45 Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes

2y

7

96 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

35 Mesa-Optimizers via Grokking

orthonormal

14d

4

26 Take 8: Queer the inner/outer alignment dichotomy.

Charlie Steiner

11d

2

20 Value Formation: An Overarching Model

Thane Ruthenis

1mo

6

72 How likely is deceptive alignment?

evhub

3mo

21

21 Greed Is the Root of This Evil

Thane Ruthenis

2mo

4

36 Broad Picture of Human Values

Thane Ruthenis

4mo

5

43 Outer vs inner misalignment: three framings

Richard_Ngo

5mo

4

8 I there a demo of "You can't fetch the coffee if you're dead"?

Ram Rachum

1mo

9

103 Selection Theorems: A Program For Understanding Agents

johnswentworth

1y

23

175 Inner Alignment: Explain like I'm 12 Edition

Rafael Harth

2y

46

34 [Intro to brain-like-AGI safety] 10. The alignment problem

Steven Byrnes

8mo

4

27 Clarifying the confusion around inner alignment

Rauno Arike

7mo

0

12 Are Generative World Models a Mesa-Optimization Risk?

Thane Ruthenis

3mo

2