Tree of Tags

Go Back

You can't go any further

Choose this branch

meritocratic regular democratic

hot top alive

17 posts Myopia

9 posts Deceptive Alignment Deception

41 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

21 Understanding and controlling auto-induced distributional shift

LRudL

1y

3

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

31 Random Thoughts on Predict-O-Matic

abramdemski

3y

3

63 Partial Agency

abramdemski

3y

18

107 The Credit Assignment Problem

abramdemski

3y

40

65 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

44 Towards a mechanistic understanding of corrigibility

evhub

3y

26

41 Acceptability Verification: A Research Agenda

David Udell

5mo

0

67 Arguments against myopic training

Richard_Ngo

2y

39

57 Open Problems with Myopia

Mark Xu

1y

16

36 Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi

1y

0

24 The Dualist Predict-O-Matic ($100 prize)

John_Maxwell

3y

35

57 AI safety via market making

evhub

2y

45

64 How likely is deceptive alignment?

evhub

3mo

21

30 Sticky goals: a concrete experiment for understanding deceptive alignment

evhub

3mo

13

119 Monitoring for deceptive alignment

evhub

3mo

7

8 Training Trace Priors

Adam Jermyn

6mo

17

17 Framings of Deceptive Alignment

peterbarnett

7mo

6

47 Will transparency help catch deception? Perhaps not

Matthew Barnett

3y

5

33 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

85 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

17 Precursor checking for deceptive alignment

evhub

4mo

0