Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
103 posts
Interpretability (ML & AI)
Machine Learning (ML)
DeepMind
Truth, Semantics, & Meaning
AI Success Models
OpenAI
Lottery Ticket Hypothesis
Anthropic
Conservatism (AI)
Honesty
Principal-Agent Problems
Map and Territory
50 posts
GPT
Bounties & Prizes (active)
AI-assisted Alignment
Moore's Law
Compute
Nanotechnology
List of Links
AI Safety Public Materials
Computer Science
Tripwire
Quantum Mechanics
11
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
59
Reframing inner alignment
davidad
9d
13
114
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
32
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
183
The Plan - 2022 Update
johnswentworth
19d
33
223
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
61
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
199
Common misconceptions about OpenAI
Jacob_Hilton
3mo
138
51
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability
David Scott Krueger (formerly: capybaralet)
1mo
25
11
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
28
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda
1mo
15
64
Clarifying AI X-risk
zac_kenton
1mo
23
55
Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Neel Nanda
1mo
14
25
Toy Models and Tegum Products
Adam Jermyn
1mo
7
90
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
18
An exploration of GPT-2's embedding weights
Adam Scherlis
7d
2
12
Research request (alignment strategy): Deep dive on "making AI solve alignment for us"
JanBrauner
19d
3
12
[LINK] - ChatGPT discussion
JanBrauner
19d
7
8
Distribution Shifts and The Importance of AI Safety
Leon Lang
2mo
2
5
AI-assisted list of ten concrete alignment things to do right now
lcmgcd
3mo
5
68
NeurIPS ML Safety Workshop 2022
Dan H
4mo
2
22
[$20K in Prizes] AI Safety Arguments Competition
Dan H
7mo
543
125
Developmental Stages of GPTs
orthonormal
2y
74
0
New(ish) AI control ideas
Stuart_Armstrong
5y
0
3
Corrigibility thoughts I: caring about multiple things
Stuart_Armstrong
5y
0
96
Collection of GPT-3 results
Kaj_Sotala
2y
24
145
interpreting GPT: the logit lens
nostalgebraist
2y
32
164
MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"
Rob Bensinger
1y
13