Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

95 posts Interpretability (ML & AI) AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Market making (AI safety technique) Verification

28 posts Myopia Self Fulfilling/Refuting Prophecies Luck

16 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

56 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

250 The Plan - 2022 Update

johnswentworth

19d

33

235 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

151 Re-Examining LayerNorm

Eric Winsor

19d

8

35 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

21 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

26 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

56 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

446 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

78 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

80 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

201 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

34 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

48 Acceptability Verification: A Research Agenda

David Udell

5mo

0

15 Generative, Episodic Objectives for Safe AI

Michael Glass

2mo

3

53 LCDT, A Myopic Decision Theory

adamShimi

1y

51

60 Open Problems with Myopia

Mark Xu

1y

16

79 The Credit Assignment Problem

abramdemski

3y

40

56 AI safety via market making

evhub

2y

45

47 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

47 Arguments against myopic training

Richard_Ngo

2y

39

13 Encouragement to Instill Confidence?

Zvi

10mo

5

56 Partial Agency

abramdemski

3y

18

21 Evan Hubinger on Homogeneity in Takeoff Speeds, Learned Optimization and Interpretability

Michaël Trazzi

1y

0

46 Towards a mechanistic understanding of corrigibility

evhub

3y

26

32 Random Thoughts on Predict-O-Matic

abramdemski

3y

3