Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

86 posts Interpretability (ML & AI) Lottery Ticket Hypothesis

9 posts AI Success Models Conservatism (AI) Market making (AI safety technique) Verification

106 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

70 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

172 The Plan - 2022 Update

johnswentworth

19d

33

31 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

36 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

83 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

58 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

47 Re-Examining LayerNorm

Eric Winsor

19d

8

64 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

9 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

230 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

42 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

18 Subsets and quotients in interpretability

Erik Jenner

18d

1

51 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

10 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

111 Conversation with Eliezer: What do you want the system to do?

Akash

5mo

38

65 Various Alignment Strategies (and how likely they are to work)

Logan Zoellner

7mo

34

81 A positive case for how we might succeed at prosaic AI alignment

evhub

1y

47

20 Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy

7mo

0

63 Solving the whole AGI control problem, version 0.0001

Steven Byrnes

1y

7

14 If AGI were coming in a year, what should we do?

MichaelStJules

8mo

16

33 Pessimism About Unknown Unknowns Inspires Conservatism

michaelcohen

2y

2

17 RFC: Philosophical Conservatism in AI Alignment Research

Gordon Seidoh Worley

4y

13