Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

123 posts Interpretability (ML & AI) Myopia Self Fulfilling/Refuting Prophecies AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Luck Market making (AI safety technique) Verification

123 posts Instrumental Convergence Corrigibility Deconfusion Orthogonality Thesis Treacherous Turn Gradient Hacking Mild Optimization Quantilization Satisficer Gradient Descent Tripwire

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

211 The Plan - 2022 Update

johnswentworth

19d

33

13 The limited upside of interpretability

Peter S. Park

1mo

11

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

60 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

47 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

99 Re-Examining LayerNorm

Eric Winsor

19d

8

22 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

29 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

68 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

58 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

36 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

32 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

36 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3

1 The Opportunity and Risks of Learning Human Values In-Context

Zachary Robertson

10d

4

25 Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real"

joraine

26d

11

12 Is the Orthogonality Thesis true for humans?

Noosphere89

1mo

18

72 Instrumental convergence is what makes general intelligence possible

tailcalled

1mo

11

36 A caveat to the Orthogonality Thesis

Wuschel Schulz

1mo

10

13 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

5 [ASoT] Instrumental convergence is useful

Ulisse Mini

1mo

9

0 Contrary to List of Lethality's point 22, alignment's door number 2

False Name, Esq.

6d

1

17 Misalignment-by-default in multi-agent systems

Edouard Harris

2mo

8

153 Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout

3y

38