Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

80 posts Oracle AI Myopia AI Boxing (Containment) Deceptive Alignment Deception Acausal Trade Self Fulfilling/Refuting Prophecies Bounties (closed) Parables & Fables Superrationality Values handshakes Computer Security & Cryptography

88 posts Conjecture (org) Language Models Refine Agency Deconfusion Scaling Laws Project Announcement Encultured AI (org) Tool AI Definitions PaLM Prompt Engineering

35 Side-channels: input versus output

davidad

8d

9

43 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

41 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

145 Decision theory does not imply that we get to have nice things

So8res

2mo

53

64 How likely is deceptive alignment?

evhub

3mo

21

30 Sticky goals: a concrete experiment for understanding deceptive alignment

evhub

3mo

13

25 Training goals for large language models

Johannes_Treutlein

5mo

5

119 Monitoring for deceptive alignment

evhub

3mo

7

21 Understanding and controlling auto-induced distributional shift

LRudL

1y

3

8 Training Trace Priors

Adam Jermyn

6mo

17

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

258 The Parable of Predict-O-Matic

abramdemski

3y

42

31 Random Thoughts on Predict-O-Matic

abramdemski

3y

3

65 Cryptographic Boxes for Unfriendly AI

paulfchristiano

12y

162

28 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

64 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

96 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

185 Simulators

janus

3mo

103

178 Mysteries of mode collapse

janus

1mo

35

28 Inverse scaling can become U-shaped

Edouard Harris

1mo

15

143 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

108 What I Learned Running Refine

adamShimi

26d

5

40 Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. Murphy

2mo

14

163 Language models seem to be much better than humans at next-token prediction

Buck

4mo

56

75 Inverse Scaling Prize: Round 1 Winners

Ethan Perez

2mo

16

45 Smoke without fire is scary

Adam Jermyn

2mo

22

56 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

38 Beware over-use of the agent model

Alex Flint

1y

10