Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

95 posts Interpretability (ML & AI) AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Market making (AI safety technique) Verification

28 posts Myopia Self Fulfilling/Refuting Prophecies Luck

338 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

211 The Plan - 2022 Update

johnswentworth

19d

33

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

139 Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner

1y

16

136 MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger

1y

13

136 A transparency and interpretability tech tree

evhub

6mo

10

135 Understanding “Deep Double Descent”

evhub

3y

51

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

123 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

118 Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth

6mo

52

118 The case for becoming a black-box investigator of language models

Buck

7mo

19

112 Conversation with Eliezer: What do you want the system to do?

Akash

5mo

38

106 A Longlist of Theories of Impact for Interpretability

Neel Nanda

9mo

29

99 Re-Examining LayerNorm

Eric Winsor

19d

8

144 Self-fulfilling correlations

PhilGoetz

12y

50

91 The Credit Assignment Problem

abramdemski

3y

40

58 Partial Agency

abramdemski

3y

18

57 Open Problems with Myopia

Mark Xu

1y

16

56 Arguments against myopic training

Richard_Ngo

2y

39

55 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

55 AI safety via market making

evhub

2y

45

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

44 Towards a mechanistic understanding of corrigibility

evhub

3y

26

43 Acceptability Verification: A Research Agenda

David Udell

5mo

0

38 Fifty Shades of Self-Fulfilling Prophecy

PhilGoetz

8y

87

38 Bayesian Evolving-to-Extinction

abramdemski

2y

13

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

33 An example of self-fulfilling spurious proofs in UDT

cousin_it

10y

43