Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

95 posts Interpretability (ML & AI) AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Market making (AI safety technique) Verification

28 posts Myopia Self Fulfilling/Refuting Prophecies Luck

10 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

106 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

31 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

172 The Plan - 2022 Update

johnswentworth

19d

33

0 The limited upside of interpretability

Peter S. Park

1mo

11

83 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

58 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

42 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

49 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

47 Re-Examining LayerNorm

Eric Winsor

19d

8

9 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

27 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

51 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

36 A Mystery About High Dimensional Concept Encoding

Fabien Roger

1mo

13

40 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

7 Generative, Episodic Objectives for Safe AI

Michael Glass

2mo

3

47 LCDT, A Myopic Decision Theory

adamShimi

1y

51

30 Random Thoughts on Predict-O-Matic

abramdemski

3y

3

45 An example of self-fulfilling spurious proofs in UDT

cousin_it

10y

43

14 Luck II: Expecting White Swans

fowlertm

9y

87

60 Partial Agency

abramdemski

3y

18

103 The Credit Assignment Problem

abramdemski

3y

40

63 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

42 Towards a mechanistic understanding of corrigibility

evhub

3y

26

35 Self-Supervised Learning and AGI Safety

Steven Byrnes

3y

9

38 Acceptability Verification: A Research Agenda

David Udell

5mo

0

65 Arguments against myopic training

Richard_Ngo

2y

39

19 Encouragement to Instill Confidence?

Zvi

10mo

5