Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
3846 posts
AI
AI Risk
GPT
AI Timelines
Anthropics
Machine Learning (ML)
AI Takeoff
Interpretability (ML & AI)
Existential Risk
Language Models
Conjecture (org)
Whole Brain Emulation
302 posts
Goodhart's Law
Neuroscience
Optimization
Predictive Processing
General Intelligence
Inner Alignment
Adaptation Executors
Superstimuli
Neuralink
Selection vs Control
Brain-Computer Interfaces
Neocortex
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
7
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
29
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
33
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
40
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
43
Next Level Seinfeld
Zvi
1d
6
70
Bad at Arithmetic, Promising at Math
cohenmacaulay
2d
17
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
199
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
108
The next decades might be wild
Marius Hobbhahn
5d
21
70
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
15
Solution to The Alignment Problem
Algon
1d
0
95
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
61
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
32
Predictive Processing, Heterosexuality and Delusions of Grandeur
lsusr
3d
2
84
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
72
My take on Jacob Cannell’s take on AGI safety
Steven Byrnes
22d
13
41
Mesa-Optimizers via Grokking
orthonormal
14d
4
28
Take 8: Queer the inner/outer alignment dichotomy.
Charlie Steiner
11d
2
55
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
TurnTrout
21d
27
81
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
41
Don't align agents to evaluations of plans
TurnTrout
24d
46
20
Take 6: CAIS is actually Orwellian.
Charlie Steiner
13d
5
48
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
79
"Normal" is the equilibrium state of past optimization processes
Alex_Altair
1mo
5
29
[Hebbian Natural Abstractions] Introduction
Samuel Nellessen
29d
3
25
Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight
Jacy Reese Anthis
1mo
8