Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
95 posts
Interpretability (ML & AI)
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Market making (AI safety technique)
Verification
28 posts
Myopia
Self Fulfilling/Refuting Prophecies
Luck
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
211
The Plan - 2022 Update
johnswentworth
19d
33
13
The limited upside of interpretability
Peter S. Park
1mo
11
159
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
60
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
47
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability
David Scott Krueger (formerly: capybaralet)
1mo
25
99
Re-Examining LayerNorm
Eric Winsor
19d
8
22
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
29
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda
1mo
15
68
Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Neel Nanda
1mo
14
46
A Mystery About High Dimensional Concept Encoding
Fabien Roger
1mo
13
37
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
11
Generative, Episodic Objectives for Safe AI
Michael Glass
2mo
3
50
LCDT, A Myopic Decision Theory
adamShimi
1y
51
31
Random Thoughts on Predict-O-Matic
abramdemski
3y
3
33
An example of self-fulfilling spurious proofs in UDT
cousin_it
10y
43
10
Luck II: Expecting White Swans
fowlertm
9y
87
58
Partial Agency
abramdemski
3y
18
91
The Credit Assignment Problem
abramdemski
3y
40
55
Why GPT wants to mesa-optimize & how we might change this
John_Maxwell
2y
32
44
Towards a mechanistic understanding of corrigibility
evhub
3y
26
29
Self-Supervised Learning and AGI Safety
Steven Byrnes
3y
9
43
Acceptability Verification: A Research Agenda
David Udell
5mo
0
56
Arguments against myopic training
Richard_Ngo
2y
39
16
Encouragement to Instill Confidence?
Zvi
10mo
5