Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1446 posts
AI
Interpretability (ML & AI)
AI Timelines
GPT
Research Agendas
Value Learning
AI Takeoff
Conjecture (org)
Embedded Agency
Machine Learning (ML)
Eliciting Latent Knowledge (ELK)
Community
118 posts
Inner Alignment
Optimization
Solomonoff Induction
Predictive Processing
Selection vs Control
Neocortex
Mesa-Optimization
Neuroscience
Priors
AI Services (CAIS)
Occam's Razor
General Intelligence
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
37
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
47
Reframing inner alignment
davidad
9d
13
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
252
Reward is not the optimization target
TurnTrout
4mo
97
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
92
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
34
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
211
The Plan - 2022 Update
johnswentworth
19d
33
159
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
265
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
96
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
37
Don't align agents to evaluations of plans
TurnTrout
24d
46
20
Value Formation: An Overarching Model
Thane Ruthenis
1mo
6
60
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
36
Applications for Deconfusing Goal-Directedness
adamShimi
1y
3
61
My take on Jacob Cannell’s take on AGI safety
Steven Byrnes
22d
13
14
Take 6: CAIS is actually Orwellian.
Charlie Steiner
13d
5
35
Mesa-Optimizers via Grokking
orthonormal
14d
4
55
Threat Model Literature Review
zac_kenton
1mo
4
103
Externalized reasoning oversight: a research direction for language model alignment
tamera
4mo
22
37
Framing AI Childhoods
David Udell
3mo
8
103
What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
johnswentworth
4mo
15
52
Humans aren't fitness maximizers
So8res
2mo
45
68
Human Mimicry Mainly Works When We’re Already Close
johnswentworth
4mo
16