Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

123 posts Interpretability (ML & AI) Myopia Self Fulfilling/Refuting Prophecies AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Luck Market making (AI safety technique) Verification

123 posts Instrumental Convergence Corrigibility Deconfusion Orthogonality Thesis Treacherous Turn Gradient Hacking Mild Optimization Quantilization Satisficer Gradient Descent Tripwire

16 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

21 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

250 The Plan - 2022 Update

johnswentworth

19d

33

26 The limited upside of interpretability

Peter S. Park

1mo

11

235 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

34 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

56 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

78 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

45 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

151 Re-Examining LayerNorm

Eric Winsor

19d

8

35 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

31 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

85 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

59 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

34 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

26 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

27 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3

0 The Opportunity and Risks of Learning Human Values In-Context

Zachary Robertson

10d

4

26 Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"

joraine

26d

11

14 Is the Orthogonality Thesis true for humans?

Noosphere89

1mo

18

85 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

39 A caveat to the Orthogonality Thesis

Wuschel Schulz

1mo

10

9 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

7 [ASoT] Instrumental convergence is useful

Ulisse Mini

1mo

9

4 Contrary to List of Lethality's point 22, alignment's door number 2

False Name, Esq.

6d

1

17 Misalignment-by-default in multi-agent systems

Edouard Harris

2mo

8

194 Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout

3y

38