Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1125 posts
AI
Research Agendas
AI Timelines
Value Learning
AI Takeoff
Embedded Agency
Eliciting Latent Knowledge (ELK)
Community
Reinforcement Learning
Iterated Amplification
Debate (AI safety technique)
Game Theory
321 posts
Conjecture (org)
GPT
Oracle AI
Interpretability (ML & AI)
Myopia
Language Models
OpenAI
AI Boxing (Containment)
Machine Learning (ML)
DeepMind
Acausal Trade
Scaling Laws
37
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
252
Reward is not the optimization target
TurnTrout
4mo
97
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
92
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
34
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
159
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
77
A shot at the diamond-alignment problem
TurnTrout
2mo
53
60
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
36
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
121
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
42
Four usages of "loss" in AI
TurnTrout
2mo
18
49
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
69
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
Akash
28d
20
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
47
Reframing inner alignment
davidad
9d
13
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
211
The Plan - 2022 Update
johnswentworth
19d
33
265
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
80
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
35
Side-channels: input versus output
davidad
8d
9
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
93
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
55
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
226
Common misconceptions about OpenAI
Jacob_Hilton
3mo
138
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27