Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1913 posts
AI
World Modeling
Inner Alignment
Rationality
Interpretability (ML & AI)
AI Timelines
Decision Theory
GPT
Research Agendas
Abstraction
Value Learning
Impact Regularization
855 posts
Logical Induction
Threat Models
Goodhart's Law
Practice & Philosophy of Science
Logical Uncertainty
Intellectual Progress (Society-Level)
Radical Probabilism
Epistemology
Ethics & Morality
Software Tools
Fiction
Bayes' Theorem
15
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
62
Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
1d
9
26
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
35
Reframing inner alignment
davidad
9d
13
38
Positive values seem more robust and lasting than prohibitions
TurnTrout
3d
9
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
271
Reward is not the optimization target
TurnTrout
4mo
97
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
102
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
132
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
20
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
33
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
189
The next decades might be wild
Marius Hobbhahn
5d
21
56
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
141
Worlds Where Iterative Design Fails
johnswentworth
3mo
26
108
AI will change the world, but won’t take it over by playing “3-dimensional chess”.
boazbarak
28d
86
82
Thoughts on AGI organizations and capabilities work
Rob Bensinger
13d
17
429
Counterarguments to the basic AI x-risk case
KatjaGrace
2mo
122
44
AI X-risk >35% mostly based on a recent peer-reviewed argument
michaelcohen
1mo
31
69
Deconfusing Direct vs Amortised Optimization
beren
18d
6
9
Corrigibility Via Thought-Process Deference
Thane Ruthenis
26d
5
67
We may be able to see sharp left turns coming
Ethan Perez
3mo
26
55
Methodological Therapy: An Agenda For Tackling Research Bottlenecks
adamShimi
2mo
6
80
Don't leave your fingerprints on the future
So8res
2mo
32
91
Oversight Misses 100% of Thoughts The AI Does Not Think
johnswentworth
4mo
49
144
Your posts should be on arXiv
JanBrauner
3mo
39