Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
3846 posts
AI
AI Risk
GPT
AI Timelines
Anthropics
Machine Learning (ML)
AI Takeoff
Interpretability (ML & AI)
Existential Risk
Language Models
Conjecture (org)
Whole Brain Emulation
302 posts
Goodhart's Law
Neuroscience
Optimization
Predictive Processing
General Intelligence
Inner Alignment
Adaptation Executors
Superstimuli
Neuralink
Selection vs Control
Brain-Computer Interfaces
Neocortex
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
84
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
41
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
5
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
112
Bad at Arithmetic, Promising at Math
cohenmacaulay
2d
17
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
47
Next Level Seinfeld
Zvi
1d
6
198
The next decades might be wild
Marius Hobbhahn
5d
21
265
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
6
I believe some AI doomers are overconfident
FTPickle
6h
4
5
Career Scouting: Housing Coordination
koratkar
5h
0
13
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
6
(Extremely) Naive Gradient Hacking Doesn't Work
ojorgensen
9h
0
59
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
28
Predictive Processing, Heterosexuality and Delusions of Grandeur
lsusr
3d
2
108
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
55
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
TurnTrout
21d
27
24
Take 8: Queer the inner/outer alignment dichotomy.
Charlie Steiner
11d
2
50
My take on Jacob Cannell’s take on AGI safety
Steven Byrnes
22d
13
29
Mesa-Optimizers via Grokking
orthonormal
14d
4
72
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
93
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13
33
Don't align agents to evaluations of plans
TurnTrout
24d
46
75
"Normal" is the equilibrium state of past optimization processes
Alex_Altair
1mo
5
39
[Hebbian Natural Abstractions] Introduction
Samuel Nellessen
29d
3
33
Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight
Jacy Reese Anthis
1mo
8
28
The Disastrously Confident And Inaccurate AI
Sharat Jacob Jacob
1mo
0