Tree of Tags

Go Back

Choose this branch

You can't go any further

meritocratic regular democratic

hot top alive

30 posts Outer Alignment Mesa-Optimization

46 posts Inner Alignment

59 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

5 Don't you think RLHF solves outer alignment?

Raphaël S

1mo

19

93 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

6 How much should we worry about mesa-optimization challenges?

sudo -i

4mo

13

15 Alignment as Game Design

Shoshannah Tekofsky

5mo

7

5 Do mesa-optimization problems correlate with low-slack?

sudo -i

10mo

1

8 Inner alignment: what are we pointing at?

lcmgcd

3mo

2

52 Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes

2y

7

17 Outer alignment and imitative amplification

evhub

2y

11

42 The Steering Problem

paulfchristiano

4y

12

19 Is the Star Trek Federation really incapable of building AI?

Kaj_Sotala

4y

4

4 Alignment via manually implementing the utility function

Chantiel

1y

6

42 Weak arguments against the universal prior being malign

X4vier

4y

23

53 An Increasingly Manipulative Newsfeed

Michaël Trazzi

3y

16

108 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

21 Value Formation: An Overarching Model

Thane Ruthenis

1mo

6

29 Mesa-Optimizers via Grokking

orthonormal

14d

4

24 Take 8: Queer the inner/outer alignment dichotomy.

Charlie Steiner

11d

2

20 I there a demo of "You can't fetch the coffee if you're dead"?

Ram Rachum

1mo

9

84 How likely is deceptive alignment?

evhub

3mo

21

80 2-D Robustness

vlad_m

3y

8

185 Inner Alignment: Explain like I'm 12 Edition

Rafael Harth

2y

46

20 Greed Is the Root of This Evil

Thane Ruthenis

2mo

4

36 Broad Picture of Human Values

Thane Ruthenis

4mo

5

8 Doom doubts - is inner alignment a likely problem?

Crissman

5mo

7

44 Outer vs inner misalignment: three framings

Richard_Ngo

5mo

4

63 Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr

1y

7

112 Selection Theorems: A Program For Understanding Agents

johnswentworth

1y

23