Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
3846 posts
AI
AI Risk
GPT
AI Timelines
Anthropics
Machine Learning (ML)
AI Takeoff
Interpretability (ML & AI)
Existential Risk
Language Models
Conjecture (org)
Whole Brain Emulation
302 posts
Goodhart's Law
Neuroscience
Optimization
Predictive Processing
General Intelligence
Inner Alignment
Adaptation Executors
Superstimuli
Neuralink
Selection vs Control
Brain-Computer Interfaces
Neocortex
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
84
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
198
The next decades might be wild
Marius Hobbhahn
5d
21
6
I believe some AI doomers are overconfident
FTPickle
6h
4
41
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
37
Reframing inner alignment
davidad
9d
13
7
Will research in AI risk jinx it? Consequences of training AI on AI risk arguments
Yann Dubois
1d
6
112
Bad at Arithmetic, Promising at Math
cohenmacaulay
2d
17
52
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
47
Next Level Seinfeld
Zvi
1d
6
26
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
11
Will Machines Ever Rule the World? MLAISU W50
Esben Kran
4d
4
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
108
Inner and outer alignment decompose one hard problem into two extremely hard problems
TurnTrout
18d
18
59
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
LawrenceC
4d
10
33
Don't align agents to evaluations of plans
TurnTrout
24d
46
72
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
55
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
TurnTrout
21d
27
21
Value Formation: An Overarching Model
Thane Ruthenis
1mo
6
33
Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight
Jacy Reese Anthis
1mo
8
28
Predictive Processing, Heterosexuality and Delusions of Grandeur
lsusr
3d
2
50
My take on Jacob Cannell’s take on AGI safety
Steven Byrnes
22d
13
47
Humans aren't fitness maximizers
So8res
2mo
45
8
Take 6: CAIS is actually Orwellian.
Charlie Steiner
13d
5
5
Don't you think RLHF solves outer alignment?
Raphaël S
1mo
19
29
Mesa-Optimizers via Grokking
orthonormal
14d
4
93
Trying to Make a Treacherous Mesa-Optimizer
MadHatter
1mo
13