Tree of Tags

Go Back

You can't go any further

You can't go any further

meritocratic regular democratic

hot top alive

10 posts Mesa-Optimization

12 posts Outer Alignment

85 Risks from Learned Optimization: Conclusion and Related Work

evhub

3y

4

38 Formal Solution to the Inner Alignment Problem

michaelcohen

1y

123

148 Risks from Learned Optimization: Introduction

evhub

3y

42

74 Mesa-Search vs Mesa-Control

abramdemski

2y

45

50 [AN #58] Mesa optimization: what it is, and why we should care

Rohin Shah

3y

9

44 Thoughts on gradient hacking

Richard_Ngo

1y

12

56 Meta learning to gradient hack

Quintin Pope

1y

11

58 Agency As a Natural Abstraction

Thane Ruthenis

7mo

9

24 [ASoT] Some thoughts about deceptive mesaoptimization

leogao

8mo

5

75 Conditions for Mesa-Optimization

evhub

3y

48

64 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

71 Human Mimicry Mainly Works When We’re Already Close

johnswentworth

4mo

16

6 Inner alignment: what are we pointing at?

lcmgcd

3mo

2

32 Outer alignment and imitative amplification

evhub

2y

11

46 The Steering Problem

paulfchristiano

4y

12

73 An Increasingly Manipulative Newsfeed

Michaël Trazzi

3y

16

78 "Inner Alignment Failures" Which Are Actually Outer Alignment Failures

johnswentworth

2y

38

18 If I were a well-intentioned AI... III: Extremal Goodhart

Stuart_Armstrong

2y

0

20 If I were a well-intentioned AI... II: Acting in a world

Stuart_Armstrong

2y

0

7 [ASoT] Some thoughts about imperfect world modeling

leogao

8mo

0

-1 Planning capacity and daemons

lcmgcd

2mo

0

35 "Designing agent incentives to avoid reward tampering", DeepMind

gwern

3y

15