Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1125 posts
AI
Research Agendas
AI Timelines
Value Learning
AI Takeoff
Embedded Agency
Eliciting Latent Knowledge (ELK)
Community
Reinforcement Learning
Iterated Amplification
Debate (AI safety technique)
Game Theory
321 posts
Conjecture (org)
GPT
Oracle AI
Interpretability (ML & AI)
Myopia
Language Models
OpenAI
AI Boxing (Containment)
Machine Learning (ML)
DeepMind
Acausal Trade
Scaling Laws
25
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
233
Reward is not the optimization target
TurnTrout
4mo
97
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
35
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
136
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
82
A shot at the diamond-alignment problem
TurnTrout
2mo
53
53
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
47
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
113
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
42
Four usages of "loss" in AI
TurnTrout
2mo
18
54
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
51
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
Akash
28d
20
11
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
28
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
59
Reframing inner alignment
davidad
9d
13
114
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
32
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
183
The Plan - 2022 Update
johnswentworth
19d
33
223
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
64
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
35
Side-channels: input versus output
davidad
8d
9
61
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
90
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
43
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
199
Common misconceptions about OpenAI
Jacob_Hilton
3mo
138
96
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27