Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1564 posts
AI
Inner Alignment
Interpretability (ML & AI)
AI Timelines
GPT
Research Agendas
AI Takeoff
Value Learning
Machine Learning (ML)
Conjecture (org)
Mesa-Optimization
Outer Alignment
349 posts
Abstraction
Impact Regularization
Rationality
World Modeling
Decision Theory
Human Values
Goal-Directedness
Anthropics
Utility Functions
Finite Factored Sets
Shard Theory
Fixed Point Theorems
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
10
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
21
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
232
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
60
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
55
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
29
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
92
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
265
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
70
Shard Theory in Nine Theses: a Distillation and Critical Appraisal
LawrenceC
1d
9
148
Finite Factored Sets in Pictures
Magdalena Wache
9d
29
42
Positive values seem more robust and lasting than prohibitions
TurnTrout
3d
9
47
Take 7: You should talk about "the human's utility function" less.
Charlie Steiner
12d
22
777
Where I agree and disagree with Eliezer
paulfchristiano
6mo
205
55
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
TurnTrout
21d
27
43
Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide
Andrew_Critch
26d
34
310
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Ajeya Cotra
5mo
89
202
The shard theory of human values
Quintin Pope
3mo
57
84
Contra shard theory, in the context of the diamond maximizer problem
So8res
2mo
16
56
«Boundaries», Part 3a: Defining boundaries as directed Markov blankets
Andrew_Critch
1mo
13
51
Humans do acausal coordination all the time
Adam Jermyn
1mo
36
175
Humans provide an untapped wealth of evidence about alignment
TurnTrout
5mo
92
29
Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight
Jacy Reese Anthis
1mo
8