Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
123 posts
Interpretability (ML & AI)
Myopia
Self Fulfilling/Refuting Prophecies
AI Success Models
Lottery Ticket Hypothesis
Conservatism (AI)
Luck
Market making (AI safety technique)
Verification
123 posts
Instrumental Convergence
Corrigibility
Deconfusion
Orthogonality Thesis
Treacherous Turn
Gradient Hacking
Mild Optimization
Quantilization
Satisficer
Gradient Descent
Tripwire
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
31
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
172
The Plan - 2022 Update
johnswentworth
19d
33
0
The limited upside of interpretability
Peter S. Park
1mo
11
83
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
beren
22d
27
40
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Evan R. Murphy
15d
16
58
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
42
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
49
"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability
David Scott Krueger (formerly: capybaralet)
1mo
25
47
Re-Examining LayerNorm
Eric Winsor
19d
8
9
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
27
A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Neel Nanda
1mo
15
51
Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Neel Nanda
1mo
14
57
You can still fetch the coffee today if you're dead tomorrow
davidad
11d
15
38
Empowerment is (almost) All We Need
jacob_cannell
1mo
43
38
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
45
Applications for Deconfusing Goal-Directedness
adamShimi
1y
3
2
The Opportunity and Risks of Learning Human Values In-Context
Zachary Robertson
10d
4
24
Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"
joraine
26d
11
10
Is the Orthogonality Thesis true for humans?
Noosphere89
1mo
18
59
Instrumental convergence is what makes general intelligence possible
tailcalled
1mo
11
33
A caveat to the Orthogonality Thesis
Wuschel Schulz
1mo
10
17
Corrigibility Via Thought-Process Deference
Thane Ruthenis
26d
5
3
[ASoT] Instrumental convergence is useful
Ulisse Mini
1mo
9
-4
Contrary to List of Lethality's point 22, alignment's door number 2
False Name, Esq.
6d
1
17
Misalignment-by-default in multi-agent systems
Edouard Harris
2mo
8
112
Seeking Power is Often Convergently Instrumental in MDPs
TurnTrout
3y
38