Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
4148 posts
AI
AI Risk
GPT
AI Timelines
Machine Learning (ML)
Anthropics
AI Takeoff
Interpretability (ML & AI)
Existential Risk
Inner Alignment
Neuroscience
Goodhart's Law
14574 posts
Decision Theory
Utility Functions
Embedded Agency
Value Learning
Suffering
Counterfactuals
Nutrition
Animal Welfare
Newcomb's Problem
Research Agendas
VNM Theorem
Risks of Astronomical Suffering (S-risks)
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
7
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
29
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
33
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
40
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
43
Next Level Seinfeld
Zvi
1d
6
70
Bad at Arithmetic, Promising at Math
cohenmacaulay
2d
17
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
199
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
108
The next decades might be wild
Marius Hobbhahn
5d
21
70
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
61
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
15
Solution to The Alignment Problem
Algon
1d
0
46
K-complexity is silly; use cross-entropy instead
So8res
1h
4
13
Note on algorithms with multiple trained components
Steven Byrnes
6h
1
34
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
58
Take 7: You should talk about "the human's utility function" less.
Charlie Steiner
12d
22
25
How can one literally buy time (from x-risk) with money?
Alex_Altair
7d
3
71
When AI solves a game, focus on the game's mechanics, not its theme.
Cleo Nardo
27d
7
31
"Attention Passengers": not for Signs
jefftk
13d
10
16
Using Obsidian if you're used to using Roam
Solenoid_Entity
9d
4
138
Decision theory does not imply that we get to have nice things
So8res
2mo
53
14
Ponzi schemes can be highly profitable if your timing is good
GeneSmith
8d
18
17
Riffing on the agent type
Quinn
12d
0
66
Humans do acausal coordination all the time
Adam Jermyn
1mo
36
218
Reward is not the optimization target
TurnTrout
4mo
97
35
A Short Dialogue on the Meaning of Reward Functions
Leon Lang
1mo
0