Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

123 posts Interpretability (ML & AI) Myopia Self Fulfilling/Refuting Prophecies AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Luck Market making (AI safety technique) Verification

123 posts Instrumental Convergence Corrigibility Deconfusion Orthogonality Thesis Treacherous Turn Gradient Hacking Mild Optimization Quantilization Satisficer Gradient Descent Tripwire

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

211 The Plan - 2022 Update

johnswentworth

19d

33

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

99 Re-Examining LayerNorm

Eric Winsor

19d

8

22 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

31 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

72 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

338 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

60 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

4 (Extremely) Naive Gradient Hacking Doesn't Work

ojorgensen

9h

0

58 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

72 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

25 Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"

joraine

26d

11

5 Assessing the Capabilities of ChatGPT through Success Rates

Zachary Robertson

7d

0

36 A caveat to the Orthogonality Thesis

Wuschel Schulz

1mo

10

32 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

111 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

36 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

13 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

109 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

27 Instrumental convergence in single-agent systems

Edouard Harris

2mo

4

105 A broad basin of attraction around human values?

Wei_Dai

8mo

16

61 Steam

abramdemski

6mo

9