Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
3846 posts
AI
AI Risk
GPT
AI Timelines
Anthropics
Machine Learning (ML)
AI Takeoff
Interpretability (ML & AI)
Existential Risk
Language Models
Conjecture (org)
Whole Brain Emulation
302 posts
Goodhart's Law
Neuroscience
Optimization
Predictive Processing
General Intelligence
Inner Alignment
Adaptation Executors
Superstimuli
Neuralink
Selection vs Control
Brain-Computer Interfaces
Neocortex
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
6
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
45
Next Level Seinfeld
Zvi
1d
6
91
Bad at Arithmetic, Promising at Math
cohenmacaulay
2d
17
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
21
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
153
The next decades might be wild
Marius Hobbhahn
5d
21
232
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
55
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
29
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
60
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
30
Predictive Processing, Heterosexuality and Delusions of Grandeur
lsusr
3d
2
96
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
61
My take on Jacob Cannell’s take on AGI safety
Steven Byrnes
22d
13
35
Mesa-Optimizers via Grokking
orthonormal
14d
4
26
Take 8: Queer the inner/outer alignment dichotomy.
Charlie Steiner
11d
2
55
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
TurnTrout
21d
27
87
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
60
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
37
Don't align agents to evaluations of plans
TurnTrout
24d
46
77
"Normal" is the equilibrium state of past optimization processes
Alex_Altair
1mo
5
14
Take 6: CAIS is actually Orwellian.
Charlie Steiner
13d
5
34
[Hebbian Natural Abstractions] Introduction
Samuel Nellessen
29d
3
29
Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight
Jacy Reese Anthis
1mo
8