Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
2237 posts
AI
AI Timelines
AI Takeoff
Careers
Audio
Infra-Bayesianism
DeepMind
Interviews
SERI MATS
Dialogue (format)
Agent Foundations
Redwood Research
358 posts
Iterated Amplification
Myopia
Factored Cognition
Humans Consulting HCH
Corrigibility
Interpretability (ML & AI)
Debate (AI safety technique)
Experiments
Self Fulfilling/Refuting Prophecies
Ought
Orthogonality Thesis
Instrumental Convergence
84
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
198
The next decades might be wild
Marius Hobbhahn
5d
21
6
I believe some AI doomers are overconfident
FTPickle
6h
4
41
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
52
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
11
Will Machines Ever Rule the World? MLAISU W50
Esben Kran
4d
4
89
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
282
AGI Safety FAQ / all-dumb-questions-allowed thread
Aryeh Englander
6mo
514
19
Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)
Remmelt
1d
6
190
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
25
If Wentworth is right about natural abstractions, it would be bad for alignment
Wuschel Schulz
12d
5
111
Revisiting algorithmic progress
Tamay
7d
6
74
Predicting GPU performance
Marius Hobbhahn
6d
24
35
Is the AI timeline too short to have children?
Yoreth
6d
20
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
26
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
21
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
250
The Plan - 2022 Update
johnswentworth
19d
33
26
The limited upside of interpretability
Peter S. Park
1mo
11
59
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
34
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
56
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
34
Empowerment is (almost) All We Need
jacob_cannell
1mo
43
184
Godzilla Strategies
johnswentworth
6mo
65
26
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
27
Applications for Deconfusing Goal-Directedness
adamShimi
1y
3