Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1125 posts
AI
Research Agendas
AI Timelines
Value Learning
AI Takeoff
Embedded Agency
Eliciting Latent Knowledge (ELK)
Community
Reinforcement Learning
Iterated Amplification
Debate (AI safety technique)
Game Theory
321 posts
Conjecture (org)
GPT
Oracle AI
Interpretability (ML & AI)
Myopia
Language Models
OpenAI
AI Boxing (Containment)
Machine Learning (ML)
DeepMind
Acausal Trade
Scaling Laws
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
251
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
7
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
12
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
53
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
182
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
13
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
44
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
29
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
33
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
26
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
15
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
132
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
67
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
31
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
307
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
96
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
70
Predicting GPU performance
Marius Hobbhahn
6d
24
239
The Plan - 2022 Update
johnswentworth
19d
33
222
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
223
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
142
Re-Examining LayerNorm
Eric Winsor
19d
8
96
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
33
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2