Tree of Tags

Go Back

Choose this branch

You can't go any further

meritocratic regular democratic

hot top alive

30 posts Outer Alignment Mesa-Optimization

46 posts Inner Alignment

166 Risks from Learned Optimization: Introduction

evhub

3y

42

94 List of resolved confusions about IDA

Wei_Dai

3y

18

87 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

62 An Increasingly Manipulative Newsfeed

Michaël Trazzi

3y

16

61 "Inner Alignment Failures" Which Are Actually Outer Alignment Failures

johnswentworth

2y

38

60 Prize for probable problems

paulfchristiano

4y

63

60 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

54 [AN #58] Mesa optimization: what it is, and why we should care

Rohin Shah

3y

9

50 Weak arguments against the universal prior being malign

X4vier

4y

23

45 Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes

2y

7

43 The Steering Problem

paulfchristiano

4y

12

30 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

24 Outer alignment and imitative amplification

evhub

2y

11

24 [ASoT] Some thoughts about deceptive mesaoptimization

leogao

8mo

5

175 Inner Alignment: Explain like I'm 12 Edition

Rafael Harth

2y

46

103 Selection Theorems: A Program For Understanding Agents

johnswentworth

1y

23

103 Demons in Imperfect Search

johnswentworth

2y

21

99 The Inner Alignment Problem

evhub

3y

17

96 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

87 Tessellating Hills: a toy model for demons in imperfect search

DaemonicSigil

2y

17

81 Open question: are minimal circuits daemon-free?

paulfchristiano

4y

70

77 2-D Robustness

vlad_m

3y

8

72 How likely is deceptive alignment?

evhub

3mo

21

70 A simple environment for showing mesa misalignment

Matthew Barnett

3y

9

70 Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr

1y

7

63 Empirical Observations of Objective Robustness Failures

jbkjr

1y

5

63 Concrete experiments in inner alignment

evhub

3y

12

54 Mesa-Search vs Mesa-Control

abramdemski

2y

45