Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
54 posts
Oracle AI
AI Boxing (Containment)
Acausal Trade
Bounties (closed)
Superrationality
Parables & Fables
Self Fulfilling/Refuting Prophecies
Values handshakes
Verification
Computer Security & Cryptography
26 posts
Myopia
Deceptive Alignment
Deception
35
Side-channels: input versus output
davidad
8d
9
43
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
145
Decision theory does not imply that we get to have nice things
So8res
2mo
53
25
Training goals for large language models
Johannes_Treutlein
5mo
5
258
The Parable of Predict-O-Matic
abramdemski
3y
42
65
Cryptographic Boxes for Unfriendly AI
paulfchristiano
12y
162
28
Breaking Oracles: superrationality and acausal trade
Stuart_Armstrong
3y
15
1
From halting oracles to modal logic
Benya_Fallenstein
7y
0
7
Multibit reflective oracles
Benya_Fallenstein
7y
0
7
Probabilistic Oracle Machines and Nash Equilibria
jessicata
7y
0
4
Non-manipulative oracles
Stuart_Armstrong
7y
0
6
UDT in the Land of Probabilistic Oracles
jessicata
7y
0
1
Oracle machines for automated philosophy
Nisan
7y
0
8
Forum Digest: Reflective Oracles
jessicata
7y
0
41
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
64
How likely is deceptive alignment?
evhub
3mo
21
30
Sticky goals: a concrete experiment for understanding deceptive alignment
evhub
3mo
13
119
Monitoring for deceptive alignment
evhub
3mo
7
21
Understanding and controlling auto-induced distributional shift
LRudL
1y
3
8
Training Trace Priors
Adam Jermyn
6mo
17
50
LCDT, A Myopic Decision Theory
adamShimi
1y
51
31
Random Thoughts on Predict-O-Matic
abramdemski
3y
3
63
Partial Agency
abramdemski
3y
18
17
Framings of Deceptive Alignment
peterbarnett
7mo
6
107
The Credit Assignment Problem
abramdemski
3y
40
65
Why GPT wants to mesa-optimize & how we might change this
John_Maxwell
2y
32
44
Towards a mechanistic understanding of corrigibility
evhub
3y
26
41
Acceptability Verification: A Research Agenda
David Udell
5mo
0