Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

123 posts Interpretability (ML & AI) Myopia Self Fulfilling/Refuting Prophecies AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Luck Market making (AI safety technique) Verification

123 posts Instrumental Convergence Corrigibility Deconfusion Orthogonality Thesis Treacherous Turn Gradient Hacking Mild Optimization Quantilization Satisficer Gradient Descent Tripwire

10 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

106 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

31 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

172 The Plan - 2022 Update

johnswentworth

19d

33

0 The limited upside of interpretability

Peter S. Park

1mo

11

83 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

40 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

58 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

42 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

49 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

47 Re-Examining LayerNorm

Eric Winsor

19d

8

9 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

27 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

51 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

57 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

38 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

38 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

45 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3

2 The Opportunity and Risks of Learning Human Values In-Context

Zachary Robertson

10d

4

24 Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"

joraine

26d

11

10 Is the Orthogonality Thesis true for humans?

Noosphere89

1mo

18

59 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

33 A caveat to the Orthogonality Thesis

Wuschel Schulz

1mo

10

17 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

3 [ASoT] Instrumental convergence is useful

Ulisse Mini

1mo

9

-4 Contrary to List of Lethality's point 22, alignment's door number 2

False Name, Esq.

6d

1

17 Misalignment-by-default in multi-agent systems

Edouard Harris

2mo

8

112 Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout

3y

38