Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
80 posts
Oracle AI
Myopia
AI Boxing (Containment)
Deceptive Alignment
Deception
Acausal Trade
Self Fulfilling/Refuting Prophecies
Bounties (closed)
Parables & Fables
Superrationality
Values handshakes
Computer Security & Cryptography
88 posts
Conjecture (org)
Language Models
Refine
Agency
Deconfusion
Scaling Laws
Project Announcement
Encultured AI (org)
Tool AI
Definitions
PaLM
Prompt Engineering
35
Side-channels: input versus output
davidad
8d
9
43
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
41
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
145
Decision theory does not imply that we get to have nice things
So8res
2mo
53
64
How likely is deceptive alignment?
evhub
3mo
21
30
Sticky goals: a concrete experiment for understanding deceptive alignment
evhub
3mo
13
25
Training goals for large language models
Johannes_Treutlein
5mo
5
119
Monitoring for deceptive alignment
evhub
3mo
7
21
Understanding and controlling auto-induced distributional shift
LRudL
1y
3
8
Training Trace Priors
Adam Jermyn
6mo
17
50
LCDT, A Myopic Decision Theory
adamShimi
1y
51
258
The Parable of Predict-O-Matic
abramdemski
3y
42
31
Random Thoughts on Predict-O-Matic
abramdemski
3y
3
65
Cryptographic Boxes for Unfriendly AI
paulfchristiano
12y
162
28
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
64
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
96
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
185
Simulators
janus
3mo
103
178
Mysteries of mode collapse
janus
1mo
35
28
Inverse scaling can become U-shaped
Edouard Harris
1mo
15
143
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
108
What I Learned Running Refine
adamShimi
26d
5
40
Paper: Large Language Models Can Self-improve [Linkpost]
Evan R. Murphy
2mo
14
163
Language models seem to be much better than humans at next-token prediction
Buck
4mo
56
75
Inverse Scaling Prize: Round 1 Winners
Ethan Perez
2mo
16
45
Smoke without fire is scary
Adam Jermyn
2mo
22
56
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
38
Beware over-use of the agent model
Alex Flint
1y
10