Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1125 posts
AI
Research Agendas
AI Timelines
Value Learning
AI Takeoff
Embedded Agency
Eliciting Latent Knowledge (ELK)
Community
Reinforcement Learning
Iterated Amplification
Debate (AI safety technique)
Game Theory
321 posts
Conjecture (org)
GPT
Oracle AI
Interpretability (ML & AI)
Myopia
Language Models
OpenAI
AI Boxing (Containment)
Machine Learning (ML)
DeepMind
Acausal Trade
Scaling Laws
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
10
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
21
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
232
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
92
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
18
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
159
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
42
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
49
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
130
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
34
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
15
Looking for an alignment tutor
JanBrauner
3d
2
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
55
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
29
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
265
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
80
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
211
The Plan - 2022 Update
johnswentworth
19d
33
59
Predicting GPU performance
Marius Hobbhahn
6d
24
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
93
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
183
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
47
Reframing inner alignment
davidad
9d
13