Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
3083 posts
AI
GPT
AI Timelines
Machine Learning (ML)
AI Takeoff
Interpretability (ML & AI)
Language Models
Conjecture (org)
Careers
Instrumental Convergence
Iterated Amplification
Art
763 posts
Anthropics
Existential Risk
Whole Brain Emulation
Sleeping Beauty Paradox
Threat Models
Academic Papers
Space Exploration & Colonization
Great Filter
Paradoxes
Extraterrestrial Life
Pascal's Mugging
Longtermism
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
7
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
29
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
33
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
40
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
43
Next Level Seinfeld
Zvi
1d
6
70
Bad at Arithmetic, Promising at Math
cohenmacaulay
2d
17
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
199
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
108
The next decades might be wild
Marius Hobbhahn
5d
21
70
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
15
Solution to The Alignment Problem
Algon
1d
0
95
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
36
AI Neorealism: a threat model & success criterion for existential safety
davidad
5d
0
59
AI Safety Seems Hard to Measure
HoldenKarnofsky
12d
5
58
Who are some prominent reasonable people who are confident that AI won't kill everyone?
Optimization Process
15d
40
93
AI will change the world, but won’t take it over by playing “3-dimensional chess”.
boazbarak
28d
86
87
Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)
Jacy Reese Anthis
28d
64
15
all claw, no world — and other thoughts on the universal distribution
carado
6d
0
217
Counterarguments to the basic AI x-risk case
KatjaGrace
2mo
122
63
Could a single alien message destroy us?
Writer
25d
23
515
Where I agree and disagree with Eliezer
paulfchristiano
6mo
205
41
Refining the Sharp Left Turn threat model, part 2: applying alignment techniques
Vika
25d
4
405
AGI Ruin: A List of Lethalities
Eliezer Yudkowsky
6mo
653
79
Am I secretly excited for AI getting weird?
porby
1mo
4
109
Niceness is unnatural
So8res
2mo
18
72
Far-UVC Light Update: No, LEDs are not around the corner (tweetstorm)
Davidmanheim
1mo
27