Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
80 posts
Oracle AI
Myopia
AI Boxing (Containment)
Deceptive Alignment
Deception
Acausal Trade
Self Fulfilling/Refuting Prophecies
Bounties (closed)
Parables & Fables
Superrationality
Values handshakes
Computer Security & Cryptography
88 posts
Conjecture (org)
Language Models
Refine
Agency
Deconfusion
Scaling Laws
Project Announcement
Encultured AI (org)
Tool AI
Definitions
PaLM
Prompt Engineering
67
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
35
Side-channels: input versus output
davidad
8d
9
33
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
89
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
139
Decision theory does not imply that we get to have nice things
So8res
2mo
53
117
Monitoring for deceptive alignment
evhub
3mo
7
80
How likely is deceptive alignment?
evhub
3mo
21
40
Sticky goals: a concrete experiment for understanding deceptive alignment
evhub
3mo
13
45
Acceptability Verification: A Research Agenda
David Udell
5mo
0
324
The Parable of Predict-O-Matic
abramdemski
3y
42
27
Training goals for large language models
Johannes_Treutlein
5mo
5
19
Precursor checking for deceptive alignment
evhub
4mo
0
29
Framings of Deceptive Alignment
peterbarnett
7mo
6
27
The Speed + Simplicity Prior is probably anti-deceptive
7mo
29
26
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
31
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
96
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
222
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
223
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
759
Simulators
janus
3mo
103
248
Mysteries of mode collapse
janus
1mo
35
97
Searching for Search
NicholasKees
22d
6
118
Conjecture Second Hiring Round
Connor Leahy
27d
0
98
What I Learned Running Refine
adamShimi
26d
5
123
Current themes in mechanistic interpretability research
Lee Sharkey
1mo
3
494
chinchilla's wild implications
nostalgebraist
4mo
114
190
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
101
Inverse Scaling Prize: Round 1 Winners
Ethan Perez
2mo
16