Tree of Tags

Go Back

You can't go any further

Choose this branch

meritocratic regular democratic

hot top alive

17 posts Myopia

9 posts Deceptive Alignment Deception

33 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

31 Understanding and controlling auto-induced distributional shift

LRudL

1y

3

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

31 Random Thoughts on Predict-O-Matic

abramdemski

3y

3

53 Partial Agency

abramdemski

3y

18

75 The Credit Assignment Problem

abramdemski

3y

40

45 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

44 Towards a mechanistic understanding of corrigibility

evhub

3y

26

45 Acceptability Verification: A Research Agenda

David Udell

5mo

0

45 Arguments against myopic training

Richard_Ngo

2y

39

57 Open Problems with Myopia

Mark Xu

1y

16

20 Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi

1y

0

8 The Dualist Predict-O-Matic ($100 prize)

John_Maxwell

3y

35

53 AI safety via market making

evhub

2y

45

80 How likely is deceptive alignment?

evhub

3mo

21

40 Sticky goals: a concrete experiment for understanding deceptive alignment

evhub

3mo

13

117 Monitoring for deceptive alignment

evhub

3mo

7

16 Training Trace Priors

Adam Jermyn

6mo

17

29 Framings of Deceptive Alignment

peterbarnett

7mo

6

39 Will transparency help catch deception? Perhaps not

Matthew Barnett

3y

5

27 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

89 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

19 Precursor checking for deceptive alignment

evhub

4mo

0