Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

47 posts Interpretability (ML & AI) Empiricism

5 posts AI Success Models Conservatism (AI) Principal-Agent Problems Market making (AI safety technique)

254 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

183 The Plan - 2022 Update

johnswentworth

19d

33

167 Chris Olah’s views on AGI safety

evhub

3y

38

155 A transparency and interpretability tech tree

evhub

6mo

10

146 Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner

1y

16

134 Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth

6mo

52

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

113 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

102 Search versus design

Alex Flint

2y

41

88 A Longlist of Theories of Impact for Interpretability

Neel Nanda

9mo

29

73 Polysemanticity and Capacity in Neural Networks

Buck

2mo

9

69 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

64 An Analytic Perspective on AI Alignment

DanielFilan

2y

45

63 How Do Selection Theorems Relate To Interpretability?

johnswentworth

6mo

14

85 A positive case for how we might succeed at prosaic AI alignment

evhub

1y

47

65 Solving the whole AGI control problem, version 0.0001

Steven Byrnes

1y

7

34 Pessimism About Unknown Unknowns Inspires Conservatism

michaelcohen

2y

2

24 Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy

7mo

0

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11