Tree of Tags

Go Back

You can't go any further

Choose this branch

meritocratic regular democratic

hot top alive

17 posts Myopia

9 posts Deceptive Alignment Deception

33 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

45 Acceptability Verification: A Research Agenda

David Udell

5mo

0

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

31 Understanding and controlling auto-induced distributional shift

LRudL

1y

3

57 Open Problems with Myopia

Mark Xu

1y

16

75 The Credit Assignment Problem

abramdemski

3y

40

53 AI safety via market making

evhub

2y

45

45 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

45 Arguments against myopic training

Richard_Ngo

2y

39

53 Partial Agency

abramdemski

3y

18

20 Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi

1y

0

44 Towards a mechanistic understanding of corrigibility

evhub

3y

26

31 Random Thoughts on Predict-O-Matic

abramdemski

3y

3

27 Bayesian Evolving-to-Extinction

abramdemski

2y

13

89 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

117 Monitoring for deceptive alignment

evhub

3mo

7

80 How likely is deceptive alignment?

evhub

3mo

21

40 Sticky goals: a concrete experiment for understanding deceptive alignment

evhub

3mo

13

19 Precursor checking for deceptive alignment

evhub

4mo

0

29 Framings of Deceptive Alignment

peterbarnett

7mo

6

27 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

16 Training Trace Priors

Adam Jermyn

6mo

17

39 Will transparency help catch deception? Perhaps not

Matthew Barnett

3y

5