Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

123 posts Interpretability (ML & AI) Myopia Self Fulfilling/Refuting Prophecies AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Luck Market making (AI safety technique) Verification

123 posts Instrumental Convergence Corrigibility Deconfusion Orthogonality Thesis Treacherous Turn Gradient Hacking Mild Optimization Quantilization Satisficer Gradient Descent Tripwire

446 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

250 The Plan - 2022 Update

johnswentworth

19d

33

235 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

201 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

164 Understanding “Deep Double Descent”

evhub

3y

51

151 Re-Examining LayerNorm

Eric Winsor

19d

8

140 Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner

1y

16

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

134 Circumventing interpretability: How to defeat mind-readers

Lee Sharkey

5mo

8

131 A Longlist of Theories of Impact for Interpretability

Neel Nanda

9mo

29

128 The case for becoming a black-box investigator of language models

Buck

7mo

19

123 A transparency and interpretability tech tree

evhub

6mo

10

114 MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger

1y

13

114 Self-fulfilling correlations

PhilGoetz

12y

50

239 Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace

3y

60

208 Sorting Pebbles Into Correct Heaps

Eliezer Yudkowsky

14y

109

194 Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout

3y

38

114 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

102 Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res

1y

35

101 Gradient hacking

evhub

3y

39

97 A broad basin of attraction around human values?

Wei_Dai

8mo

16

91 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

85 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

80 A Gym Gridworld Environment for the Treacherous Turn

Michaël Trazzi

4y

9

78 Coherence arguments imply a force for goal-directed behavior

KatjaGrace

1y

27

69 Looking Deeper at Deconfusion

adamShimi

1y

13

62 Steam

abramdemski

6mo

9

61 Goal retention discussion with Eliezer

MaxTegmark

8y

26