Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1913 posts
AI
World Modeling
Inner Alignment
Rationality
Interpretability (ML & AI)
AI Timelines
Decision Theory
GPT
Research Agendas
Abstraction
Value Learning
Impact Regularization
855 posts
Logical Induction
Threat Models
Goodhart's Law
Practice & Philosophy of Science
Logical Uncertainty
Intellectual Progress (Society-Level)
Radical Probabilism
Epistemology
Ethics & Morality
Software Tools
Fiction
Bayes' Theorem
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
70
Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
1d
9
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
10
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
21
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
232
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
60
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
148
Finite Factored Sets in Pictures
Magdalena Wache
9d
29
42
Positive values seem more robust and lasting than prohibitions
TurnTrout
3d
9
55
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
155
The next decades might be wild
Marius Hobbhahn
5d
21
39
AI Neorealism: a threat model & success criterion for existential safety
davidad
5d
0
94
Thoughts on AGI organizations and capabilities work
Rob Bensinger
13d
17
124
Logical induction for software engineers
Alex Flint
17d
2
58
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
336
Counterarguments to the basic AI x-risk case
KatjaGrace
2mo
122
31
Reflections on the PIBBSS Fellowship 2022
Nora_Ammann
9d
0
103
AI will change the world, but won’t take it over by playing “3-dimensional chess”.
boazbarak
28d
86
48
Deconfusing Direct vs Amortised Optimization
beren
18d
6
724
AGI Ruin: A List of Lethalities
Eliezer Yudkowsky
6mo
653
207
Lessons learned from talking to >100 academics about AI safety
Marius Hobbhahn
2mo
16
36
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Vika
25d
4
161
Most People Start With The Same Few Bad Ideas
johnswentworth
3mo
30
98
Niceness is unnatural
So8res
2mo
18