Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
47 posts
Interpretability (ML & AI)
Empiricism
5 posts
AI Success Models
Conservatism (AI)
Principal-Agent Problems
Market making (AI safety technique)
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
211
The Plan - 2022 Update
johnswentworth
19d
33
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
99
Re-Examining LayerNorm
Eric Winsor
19d
8
22
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
31
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
72
Engineering Monosemanticity in Toy Models
Adam Jermyn
1mo
6
338
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
24
Subsets and quotients in interpretability
Erik Jenner
18d
1
68
Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Neel Nanda
1mo
14
62
A Barebones Guide to Mechanistic Interpretability Prerequisites
Neel Nanda
1mo
8
66
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers
Neel Nanda
2mo
5
78
Polysemanticity and Capacity in Neural Networks
Buck
2mo
9
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
45
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Evan R. Murphy
7mo
0
78
A positive case for how we might succeed at prosaic AI alignment
evhub
1y
47
60
Solving the whole AGI control problem, version 0.0001
Steven Byrnes
1y
7
31
Pessimism About Unknown Unknowns Inspires Conservatism
michaelcohen
2y
2