Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

86 posts Interpretability (ML & AI) Lottery Ticket Hypothesis

9 posts AI Success Models Conservatism (AI) Market making (AI safety technique) Verification

106 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

31 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

172 The Plan - 2022 Update

johnswentworth

19d

33

0 The limited upside of interpretability

Peter S. Park

1mo

11

83 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

58 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

42 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

49 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

47 Re-Examining LayerNorm

Eric Winsor

19d

8

9 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

27 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

51 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

36 A Mystery About High Dimensional Concept Encoding

Fabien Roger

1mo

13

45 Interpreting Neural Networks through the Polytope Lens

Sid Black

2mo

26

10 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

111 Conversation with Eliezer: What do you want the system to do?

Akash

5mo

38

81 A positive case for how we might succeed at prosaic AI alignment

evhub

1y

47

17 RFC: Philosophical Conservatism in AI Alignment Research

Gordon Seidoh Worley

4y

13

65 Various Alignment Strategies (and how likely they are to work)

Logan Zoellner

7mo

34

20 Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy

7mo

0

63 Solving the whole AGI control problem, version 0.0001

Steven Byrnes

1y

7

33 Pessimism About Unknown Unknowns Inspires Conservatism

michaelcohen

2y

2

14 If AGI were coming in a year, what should we do?

MichaelStJules

8mo

16