Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

95 posts Interpretability (ML & AI) AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Market making (AI safety technique) Verification

28 posts Myopia Self Fulfilling/Refuting Prophecies Luck

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

211 The Plan - 2022 Update

johnswentworth

19d

33

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

99 Re-Examining LayerNorm

Eric Winsor

19d

8

22 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

31 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

72 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

338 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

60 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

24 Subsets and quotients in interpretability

Erik Jenner

18d

1

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

43 Acceptability Verification: A Research Agenda

David Udell

5mo

0

11 Generative, Episodic Objectives for Safe AI

Michael Glass

2mo

3

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

57 Open Problems with Myopia

Mark Xu

1y

16

91 The Credit Assignment Problem

abramdemski

3y

40

55 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

56 Arguments against myopic training

Richard_Ngo

2y

39

16 Encouragement to Instill Confidence?

Zvi

10mo

5

55 AI safety via market making

evhub

2y

45

28 Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi

1y

0

58 Partial Agency

abramdemski

3y

18

38 Bayesian Evolving-to-Extinction

abramdemski

2y

13

44 Towards a mechanistic understanding of corrigibility

evhub

3y

26