Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

54 posts Oracle AI AI Boxing (Containment) Acausal Trade Bounties (closed) Superrationality Parables & Fables Self Fulfilling/Refuting Prophecies Values handshakes Verification Computer Security & Cryptography

26 posts Myopia Deceptive Alignment Deception

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

35 Side-channels: input versus output

davidad

8d

9

142 Decision theory does not imply that we get to have nice things

So8res

2mo

53

291 The Parable of Predict-O-Matic

abramdemski

3y

42

26 Training goals for large language models

Johannes_Treutlein

5mo

5

41 Superrational Agents Kelly Bet Influence!

abramdemski

1y

5

58 Results of $1,000 Oracle contest!

Stuart_Armstrong

2y

2

57 Contest: $1,000 for good questions to ask to an Oracle AI

Stuart_Armstrong

3y

156

47 Counterfactual Oracles = online supervised learning with random selection of training episodes

Wei_Dai

3y

26

60 Prize for probable problems

paulfchristiano

4y

63

29 Self-Supervised Learning and AGI Safety

Steven Byrnes

3y

9

25 Breaking Oracles: superrationality and acausal trade

Stuart_Armstrong

3y

15

22 Analysing: Dangerous messages from future UFAI via Oracles

Stuart_Armstrong

3y

16

20 Oracles: reject all deals - break superrationality, with superrationality

Stuart_Armstrong

3y

4

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

87 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

118 Monitoring for deceptive alignment

evhub

3mo

7

72 How likely is deceptive alignment?

evhub

3mo

21

35 Sticky goals: a concrete experiment for understanding deceptive alignment

evhub

3mo

13

43 Acceptability Verification: A Research Agenda

David Udell

5mo

0

18 Precursor checking for deceptive alignment

evhub

4mo

0

30 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

23 Framings of Deceptive Alignment

peterbarnett

7mo

6

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

57 Open Problems with Myopia

Mark Xu

1y

16

26 Understanding and controlling auto-induced distributional shift

LRudL

1y

3

12 Training Trace Priors

Adam Jermyn

6mo

17

91 The Credit Assignment Problem

abramdemski

3y

40