Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1125 posts
AI
Research Agendas
AI Timelines
Value Learning
AI Takeoff
Embedded Agency
Eliciting Latent Knowledge (ELK)
Community
Reinforcement Learning
Iterated Amplification
Debate (AI safety technique)
Game Theory
321 posts
Conjecture (org)
GPT
Oracle AI
Interpretability (ML & AI)
Myopia
Language Models
OpenAI
AI Boxing (Containment)
Machine Learning (ML)
DeepMind
Acausal Trade
Scaling Laws
13
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
30
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
213
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
73
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
23
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
55
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
54
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
136
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
17
Looking for an alignment tutor
JanBrauner
3d
2
35
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
47
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
28
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
11
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
114
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
27
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
43
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
223
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
64
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
183
The Plan - 2022 Update
johnswentworth
19d
33
48
Predicting GPU performance
Marius Hobbhahn
6d
24
32
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
59
Reframing inner alignment
davidad
9d
13
90
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
143
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
35
Side-channels: input versus output
davidad
8d
9