Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
95 posts
Interpretability (ML & AI)
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Market making (AI safety technique)
Verification
28 posts
Myopia
Self Fulfilling/Refuting Prophecies
Luck
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
31
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
172
The Plan - 2022 Update
johnswentworth
19d
33
0
The limited upside of interpretability
Peter S. Park
1mo
11
83
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
58
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
42
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
49
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability
David Scott Krueger (formerly: capybaralet)
1mo
25
47
Re-Examining LayerNorm
Eric Winsor
19d
8
9
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
27
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda
1mo
15
51
Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Neel Nanda
1mo
14
36
A Mystery About High Dimensional Concept Encoding
Fabien Roger
1mo
13
40
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
7
Generative, Episodic Objectives for Safe AI
Michael Glass
2mo
3
47
LCDT, A Myopic Decision Theory
adamShimi
1y
51
30
Random Thoughts on Predict-O-Matic
abramdemski
3y
3
45
An example of self-fulfilling spurious proofs in UDT
cousin_it
10y
43
14
Luck II: Expecting White Swans
fowlertm
9y
87
60
Partial Agency
abramdemski
3y
18
103
The Credit Assignment Problem
abramdemski
3y
40
63
Why GPT wants to mesa-optimize & how we might change this
John_Maxwell
2y
32
42
Towards a mechanistic understanding of corrigibility
evhub
3y
26
35
Self-Supervised Learning and AGI Safety
Steven Byrnes
3y
9
38
Acceptability Verification: A Research Agenda
David Udell
5mo
0
65
Arguments against myopic training
Richard_Ngo
2y
39
19
Encouragement to Instill Confidence?
Zvi
10mo
5