Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

246 posts Interpretability (ML & AI) Instrumental Convergence Corrigibility Myopia Deconfusion Self Fulfilling/Refuting Prophecies Orthogonality Thesis AI Success Models Treacherous Turn Gradient Hacking Mild Optimization Quantilization

112 posts Iterated Amplification Debate (AI safety technique) Factored Cognition Humans Consulting HCH Experiments Ought AI-assisted Alignment Memory and Mnemonics Air Conditioning Verification

446 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

250 The Plan - 2022 Update

johnswentworth

19d

33

239 Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace

3y

60

235 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

208 Sorting Pebbles Into Correct Heaps

Eliezer Yudkowsky

14y

109

201 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

194 Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout

3y

38

164 Understanding “Deep Double Descent”

evhub

3y

51

151 Re-Examining LayerNorm

Eric Winsor

19d

8

140 Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner

1y

16

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

134 Circumventing interpretability: How to defeat mind-readers

Lee Sharkey

5mo

8

131 A Longlist of Theories of Impact for Interpretability

Neel Nanda

9mo

29

128 The case for becoming a black-box investigator of language models

Buck

7mo

19

184 Godzilla Strategies

johnswentworth

6mo

65

139 Paul's research agenda FAQ

zhukeepa

4y

73

121 My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

Chi Nguyen

2y

21

120 Supervise Process, not Outcomes

stuhlmueller

8mo

8

118 Debate update: Obfuscated arguments problem

Beth Barnes

1y

21

112 Preregistration: Air Conditioner Test

johnswentworth

8mo

64

105 Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie

3mo

4

98 Ought: why it matters and ways to help

paulfchristiano

3y

7

97 Writeup: Progress on AI Safety via Debate

Beth Barnes

2y

18

89 Solving Math Problems by Relay

bgold

2y

26

82 Imitative Generalisation (AKA 'Learning the Prior')

Beth Barnes

1y

14

80 Air Conditioner Test Results & Discussion

johnswentworth

6mo

38

74 A guide to Iterated Amplification & Debate

Rafael Harth

2y

10

69 Why I'm excited about Debate

Richard_Ngo

1y

12