Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1564 posts
AI
Inner Alignment
Interpretability (ML & AI)
AI Timelines
GPT
Research Agendas
AI Takeoff
Value Learning
Machine Learning (ML)
Conjecture (org)
Mesa-Optimization
Outer Alignment
349 posts
Abstraction
Impact Regularization
Rationality
World Modeling
Decision Theory
Human Values
Goal-Directedness
Anthropics
Utility Functions
Finite Factored Sets
Shard Theory
Fixed Point Theorems
11
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
28
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
25
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
59
Reframing inner alignment
davidad
9d
13
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
233
Reward is not the optimization target
TurnTrout
4mo
97
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
90
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
114
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
32
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
35
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
42
Don't align agents to evaluations of plans
TurnTrout
24d
46
183
The Plan - 2022 Update
johnswentworth
19d
33
78
Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
1d
9
46
Positive values seem more robust and lasting than prohibitions
TurnTrout
3d
9
125
Finite Factored Sets in Pictures
Magdalena Wache
9d
29
58
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
TurnTrout
21d
27
48
Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide
Andrew_Critch
26d
34
93
wrapper-minds are the enemy
nostalgebraist
6mo
36
26
Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight
Jacy Reese Anthis
1mo
8
155
The shard theory of human values
Quintin Pope
3mo
57
28
Traps of Formalization in Deconfusion
adamShimi
1y
7
159
Humans provide an untapped wealth of evidence about alignment
TurnTrout
5mo
92
75
«Boundaries», Part 3a: Defining boundaries as directed Markov blankets
Andrew_Critch
1mo
13
92
Human values & biases are inaccessible to the genome
TurnTrout
5mo
51
78
Builder/Breaker for Deconfusion
abramdemski
2mo
9
239
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Ajeya Cotra
5mo
89