Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
246 posts
Interpretability (ML & AI)
Instrumental Convergence
Corrigibility
Myopia
Deconfusion
Self Fulfilling/Refuting Prophecies
Orthogonality Thesis
AI Success Models
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
112 posts
Iterated Amplification
Debate (AI safety technique)
Factored Cognition
Humans Consulting HCH
Experiments
Ought
AI-assisted Alignment
Memory and Mnemonics
Air Conditioning
Verification
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
21
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
250
The Plan - 2022 Update
johnswentworth
19d
33
26
The limited upside of interpretability
Peter S. Park
1mo
11
59
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
34
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
56
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
34
Empowerment is (almost) All We Need
jacob_cannell
1mo
43
26
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
27
Applications for Deconfusing Goal-Directedness
adamShimi
1y
3
78
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
0
The Opportunity and Risks of Learning Human Values In-Context
Zachary Robertson
10d
4
26
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
184
Godzilla Strategies
johnswentworth
6mo
65
44
Notes on OpenAI’s alignment plan
Alex Flint
12d
5
11
Alignment with argument-networks and assessment-predictions
Tor Økland Barstad
7d
3
21
Research request (alignment strategy): Deep dive on "making AI solve alignment for us"
JanBrauner
19d
3
62
Rant on Problem Factorization for Alignment
johnswentworth
4mo
48
61
Relaxed adversarial training for inner alignment
evhub
3y
28
17
Provably Honest - A First Step
Srijanak De
1mo
2
118
Debate update: Obfuscated arguments problem
Beth Barnes
1y
21
20
AI Safety via Debate
ESRogs
4y
13
13
Getting from an unaligned AGI to an aligned AGI?
Tor Økland Barstad
6mo
7
80
Air Conditioner Test Results & Discussion
johnswentworth
6mo
38
12
AI-assisted list of ten concrete alignment things to do right now
lcmgcd
3mo
5
139
Paul's research agenda FAQ
zhukeepa
4y
73