Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
86 posts
Interpretability (ML & AI)
Lottery Ticket Hypothesis
9 posts
AI Success Models
Conservatism (AI)
Market making (AI safety technique)
Verification
140
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
56
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
250
The Plan - 2022 Update
johnswentworth
19d
33
235
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
151
Re-Examining LayerNorm
Eric Winsor
19d
8
35
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
21
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
26
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
56
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
446
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
78
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
80
Engineering Monosemanticity in Toy Models
Adam Jermyn
1mo
6
201
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
30
Subsets and quotients in interpretability
Erik Jenner
18d
1
16
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
113
Conversation with Eliezer: What do you want the system to do?
Akash
5mo
38
81
Various Alignment Strategies (and how likely they are to work)
Logan Zoellner
7mo
34
70
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Evan R. Murphy
7mo
0
75
A positive case for how we might succeed at prosaic AI alignment
evhub
1y
47
26
If AGI were coming in a year, what should we do?
MichaelStJules
8mo
16
57
Solving the whole AGI control problem, version 0.0001
Steven Byrnes
1y
7
29
Pessimism About Unknown Unknowns Inspires Conservatism
michaelcohen
2y
2
17
RFC: Philosophical Conservatism in AI Alignment Research
Gordon Seidoh Worley
4y
13