Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
2237 posts
AI
AI Timelines
AI Takeoff
Careers
Audio
Infra-Bayesianism
DeepMind
Interviews
SERI MATS
Dialogue (format)
Agent Foundations
Redwood Research
358 posts
Iterated Amplification
Myopia
Factored Cognition
Humans Consulting HCH
Corrigibility
Interpretability (ML & AI)
Debate (AI safety technique)
Experiments
Self Fulfilling/Refuting Prophecies
Ought
Orthogonality Thesis
Instrumental Convergence
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
153
The next decades might be wild
Marius Hobbhahn
5d
21
3
I believe some AI doomers are overconfident
FTPickle
6h
4
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
37
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
12
Will Machines Ever Rule the World? MLAISU W50
Esben Kran
4d
4
92
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
221
AGI Safety FAQ / all-dumb-questions-allowed thread
Aryeh Englander
6mo
514
8
Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)
Remmelt
1d
6
159
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
27
If Wentworth is right about natural abstractions, it would be bad for alignment
Wuschel Schulz
12d
5
92
Revisiting algorithmic progress
Tamay
7d
6
59
Predicting GPU performance
Marius Hobbhahn
6d
24
33
Is the AI timeline too short to have children?
Yoreth
6d
20
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
36
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
211
The Plan - 2022 Update
johnswentworth
19d
33
13
The limited upside of interpretability
Peter S. Park
1mo
11
58
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
37
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
36
Empowerment is (almost) All We Need
jacob_cannell
1mo
43
151
Godzilla Strategies
johnswentworth
6mo
65
32
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
36
Applications for Deconfusing Goal-Directedness
adamShimi
1y
3