Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

80 posts Oracle AI Myopia AI Boxing (Containment) Deceptive Alignment Deception Acausal Trade Self Fulfilling/Refuting Prophecies Bounties (closed) Parables & Fables Superrationality Values handshakes Computer Security & Cryptography

88 posts Conjecture (org) Language Models Refine Agency Deconfusion Scaling Laws Project Announcement Encultured AI (org) Tool AI Definitions PaLM Prompt Engineering

67 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

35 Side-channels: input versus output

davidad

8d

9

33 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

89 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

139 Decision theory does not imply that we get to have nice things

So8res

2mo

53

117 Monitoring for deceptive alignment

evhub

3mo

7

80 How likely is deceptive alignment?

evhub

3mo

21

40 Sticky goals: a concrete experiment for understanding deceptive alignment

evhub

3mo

13

45 Acceptability Verification: A Research Agenda

David Udell

5mo

0

324 The Parable of Predict-O-Matic

abramdemski

3y

42

27 Training goals for large language models

Johannes_Treutlein

5mo

5

19 Precursor checking for deceptive alignment

evhub

4mo

0

29 Framings of Deceptive Alignment

peterbarnett

7mo

6

27 The Speed + Simplicity Prior is probably anti-deceptive

7mo

29

26 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

31 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

96 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

222 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

223 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

759 Simulators

janus

3mo

103

248 Mysteries of mode collapse

janus

1mo

35

97 Searching for Search

NicholasKees

22d

6

118 Conjecture Second Hiring Round

Connor Leahy

27d

0

98 What I Learned Running Refine

adamShimi

26d

5

123 Current themes in mechanistic interpretability research

Lee Sharkey

1mo

3

494 chinchilla's wild implications

nostalgebraist

4mo

114

190 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

101 Inverse Scaling Prize: Round 1 Winners

Ethan Perez

2mo

16