Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
86 posts
Interpretability (ML & AI)
Lottery Ticket Hypothesis
9 posts
AI Success Models
Conservatism (AI)
Market making (AI safety technique)
Verification
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
31
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
172
The Plan - 2022 Update
johnswentworth
19d
33
0
The limited upside of interpretability
Peter S. Park
1mo
11
83
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
58
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
42
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
49
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability
David Scott Krueger (formerly: capybaralet)
1mo
25
47
Re-Examining LayerNorm
Eric Winsor
19d
8
9
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
27
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda
1mo
15
51
Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Neel Nanda
1mo
14
36
A Mystery About High Dimensional Concept Encoding
Fabien Roger
1mo
13
45
Interpreting Neural Networks through the Polytope Lens
Sid Black
2mo
26
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
111
Conversation with Eliezer: What do you want the system to do?
Akash
5mo
38
81
A positive case for how we might succeed at prosaic AI alignment
evhub
1y
47
17
RFC: Philosophical Conservatism in AI Alignment Research
Gordon Seidoh Worley
4y
13
65
Various Alignment Strategies (and how likely they are to work)
Logan Zoellner
7mo
34
20
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Evan R. Murphy
7mo
0
63
Solving the whole AGI control problem, version 0.0001
Steven Byrnes
1y
7
33
Pessimism About Unknown Unknowns Inspires Conservatism
michaelcohen
2y
2
14
If AGI were coming in a year, what should we do?
MichaelStJules
8mo
16