Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

86 posts Interpretability (ML & AI) Lottery Ticket Hypothesis

9 posts AI Success Models Conservatism (AI) Market making (AI safety technique) Verification

338 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

211 The Plan - 2022 Update

johnswentworth

19d

33

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

139 Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner

1y

16

136 MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger

1y

13

136 A transparency and interpretability tech tree

evhub

6mo

10

135 Understanding “Deep Double Descent”

evhub

3y

51

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

123 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

118 Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth

6mo

52

118 The case for becoming a black-box investigator of language models

Buck

7mo

19

106 A Longlist of Theories of Impact for Interpretability

Neel Nanda

9mo

29

99 Re-Examining LayerNorm

Eric Winsor

19d

8

94 Circumventing interpretability: How to defeat mind-readers

Lee Sharkey

5mo

8

112 Conversation with Eliezer: What do you want the system to do?

Akash

5mo

38

78 A positive case for how we might succeed at prosaic AI alignment

evhub

1y

47

73 Various Alignment Strategies (and how likely they are to work)

Logan Zoellner

7mo

34

60 Solving the whole AGI control problem, version 0.0001

Steven Byrnes

1y

7

45 Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy

7mo

0

31 Pessimism About Unknown Unknowns Inspires Conservatism

michaelcohen

2y

2

20 If AGI were coming in a year, what should we do?

MichaelStJules

8mo

16

17 RFC: Philosophical Conservatism in AI Alignment Research

Gordon Seidoh Worley

4y

13

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11