Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

95 posts Interpretability (ML & AI) AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Market making (AI safety technique) Verification

28 posts Myopia Self Fulfilling/Refuting Prophecies Luck

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

211 The Plan - 2022 Update

johnswentworth

19d

33

13 The limited upside of interpretability

Peter S. Park

1mo

11

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

60 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

47 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

99 Re-Examining LayerNorm

Eric Winsor

19d

8

22 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

29 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

68 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

46 A Mystery About High Dimensional Concept Encoding

Fabien Roger

1mo

13

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

11 Generative, Episodic Objectives for Safe AI

Michael Glass

2mo

3

50 LCDT, A Myopic Decision Theory

adamShimi

1y

51

31 Random Thoughts on Predict-O-Matic

abramdemski

3y

3

33 An example of self-fulfilling spurious proofs in UDT

cousin_it

10y

43

10 Luck II: Expecting White Swans

fowlertm

9y

87

58 Partial Agency

abramdemski

3y

18

91 The Credit Assignment Problem

abramdemski

3y

40

55 Why GPT wants to mesa-optimize & how we might change this

John_Maxwell

2y

32

44 Towards a mechanistic understanding of corrigibility

evhub

3y

26

29 Self-Supervised Learning and AGI Safety

Steven Byrnes

3y

9

43 Acceptability Verification: A Research Agenda

David Udell

5mo

0

56 Arguments against myopic training

Richard_Ngo

2y

39

16 Encouragement to Instill Confidence?

Zvi

10mo

5