Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1913 posts
AI
World Modeling
Inner Alignment
Rationality
Interpretability (ML & AI)
AI Timelines
Decision Theory
GPT
Research Agendas
Abstraction
Value Learning
Impact Regularization
855 posts
Logical Induction
Threat Models
Goodhart's Law
Practice & Philosophy of Science
Logical Uncertainty
Intellectual Progress (Society-Level)
Radical Probabilism
Epistemology
Ethics & Morality
Software Tools
Fiction
Bayes' Theorem
28
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
78
Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
1d
9
13
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
30
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
11
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
213
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
114
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
73
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
64
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
46
Positive values seem more robust and lasting than prohibitions
TurnTrout
3d
9
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
23
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
121
The next decades might be wild
Marius Hobbhahn
5d
21
106
Thoughts on AGI organizations and capabilities work
Rob Bensinger
13d
17
38
AI Neorealism: a threat model & success criterion for existential safety
davidad
5d
0
116
Logical induction for software engineers
Alex Flint
17d
2
60
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
98
AI will change the world, but won’t take it over by playing “3-dimensional chess”.
boazbarak
28d
86
243
Counterarguments to the basic AI x-risk case
KatjaGrace
2mo
122
20
Reflections on the PIBBSS Fellowship 2022
Nora_Ammann
9d
0
149
Lessons learned from talking to >100 academics about AI safety
Marius Hobbhahn
2mo
16
462
AGI Ruin: A List of Lethalities
Eliezer Yudkowsky
6mo
653
42
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Vika
25d
4
27
Deconfusing Direct vs Amortised Optimization
beren
18d
6
113
Niceness is unnatural
So8res
2mo
18
164
Most People Start With The Same Few Bad Ideas
johnswentworth
3mo
30