Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1446 posts
AI
Interpretability (ML & AI)
AI Timelines
GPT
Research Agendas
Value Learning
AI Takeoff
Conjecture (org)
Embedded Agency
Machine Learning (ML)
Eliciting Latent Knowledge (ELK)
Community
118 posts
Inner Alignment
Optimization
Solomonoff Induction
Predictive Processing
Selection vs Control
Neocortex
Mesa-Optimization
Neuroscience
Priors
AI Services (CAIS)
Occam's Razor
General Intelligence
28
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
13
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
30
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
11
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
213
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
114
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
73
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
23
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
27
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
55
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
43
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
64
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
90
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
75
My take on Jacob Cannell’s take on AGI safety
Steven Byrnes
22d
13
42
Mesa-Optimizers via Grokking
orthonormal
14d
4
29
Take 8: Queer the inner/outer alignment dichotomy.
Charlie Steiner
11d
2
42
Don't align agents to evaluations of plans
TurnTrout
24d
46
20
Take 6: CAIS is actually Orwellian.
Charlie Steiner
13d
5
45
Threat Model Literature Review
zac_kenton
1mo
4
59
Humans aren't fitness maximizers
So8res
2mo
45
95
What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
johnswentworth
4mo
15
20
Value Formation: An Overarching Model
Thane Ruthenis
1mo
6
71
Human Mimicry Mainly Works When We’re Already Close
johnswentworth
4mo
16
79
Externalized reasoning oversight: a research direction for language model alignment
tamera
4mo
22
23
Greed Is the Root of This Evil
Thane Ruthenis
2mo
4