Tree of Tags

Go Back

You can't go any further

Choose this branch

meritocratic regular democratic

hot top alive

17 posts Myopia

9 posts Deceptive Alignment Deception

41 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

41 Acceptability Verification: A Research Agenda

David Udell

5mo

0

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

57 Open Problems with Myopia

Mark Xu

1y

16

107 The Credit Assignment Problem

abramdemski

3y

40

65 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

67 Arguments against myopic training

Richard_Ngo

2y

39

36 Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi

1y

0

21 Understanding and controlling auto-induced distributional shift

LRudL

1y

3

57 AI safety via market making

evhub

2y

45

63 Partial Agency

abramdemski

3y

18

49 Bayesian Evolving-to-Extinction

abramdemski

2y

13

44 Towards a mechanistic understanding of corrigibility

evhub

3y

26

37 Defining Myopia

abramdemski

3y

18

85 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

119 Monitoring for deceptive alignment

evhub

3mo

7

64 How likely is deceptive alignment?

evhub

3mo

21

30 Sticky goals: a concrete experiment for understanding deceptive alignment

evhub

3mo

13

33 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

17 Precursor checking for deceptive alignment

evhub

4mo

0

17 Framings of Deceptive Alignment

peterbarnett

7mo

6

8 Training Trace Priors

Adam Jermyn

6mo

17

47 Will transparency help catch deception? Perhaps not

Matthew Barnett

3y

5