Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
2237 posts
AI
AI Timelines
AI Takeoff
Careers
Audio
Infra-Bayesianism
DeepMind
Interviews
SERI MATS
Dialogue (format)
Agent Foundations
Redwood Research
358 posts
Iterated Amplification
Myopia
Factored Cognition
Humans Consulting HCH
Corrigibility
Interpretability (ML & AI)
Debate (AI safety technique)
Experiments
Self Fulfilling/Refuting Prophecies
Ought
Orthogonality Thesis
Instrumental Convergence
40
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
108
The next decades might be wild
Marius Hobbhahn
5d
21
0
I believe some AI doomers are overconfident
FTPickle
6h
4
33
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
22
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
13
Will Machines Ever Rule the World? MLAISU W50
Esben Kran
4d
4
95
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
160
AGI Safety FAQ / all-dumb-questions-allowed thread
Aryeh Englander
6mo
514
-3
Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)
Remmelt
1d
6
128
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
29
If Wentworth is right about natural abstractions, it would be bad for alignment
Wuschel Schulz
12d
5
73
Revisiting algorithmic progress
Tamay
7d
6
44
Predicting GPU performance
Marius Hobbhahn
6d
24
31
Is the AI timeline too short to have children?
Yoreth
6d
20
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
46
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
31
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
172
The Plan - 2022 Update
johnswentworth
19d
33
0
The limited upside of interpretability
Peter S. Park
1mo
11
57
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
83
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
40
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
58
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
38
Empowerment is (almost) All We Need
jacob_cannell
1mo
43
118
Godzilla Strategies
johnswentworth
6mo
65
38
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
45
Applications for Deconfusing Goal-Directedness
adamShimi
1y
3