Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

123 posts Interpretability (ML & AI) Myopia Self Fulfilling/Refuting Prophecies AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Luck Market making (AI safety technique) Verification

123 posts Instrumental Convergence Corrigibility Deconfusion Orthogonality Thesis Treacherous Turn Gradient Hacking Mild Optimization Quantilization Satisficer Gradient Descent Tripwire

16 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

56 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

250 The Plan - 2022 Update

johnswentworth

19d

33

235 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

151 Re-Examining LayerNorm

Eric Winsor

19d

8

35 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

21 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

26 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

56 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

446 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

78 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

34 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

80 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

6 (Extremely) Naive Gradient Hacking Doesn't Work

ojorgensen

9h

0

59 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

85 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

6 Assessing the Capabilities of ChatGPT through Success Rates

Zachary Robertson

7d

0

26 Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"

joraine

26d

11

39 A caveat to the Orthogonality Thesis

Wuschel Schulz

1mo

10

4 Contrary to List of Lethality's point 22, alignment's door number 2

False Name, Esq.

6d

1

114 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

26 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

34 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

32 Instrumental convergence in single-agent systems

Edouard Harris

2mo

4

91 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

9 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

25 CHAI, Assistance Games, And Fully-Updated Deference [Scott Alexander]

berglund

2mo

1