Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

246 posts Interpretability (ML & AI) Instrumental Convergence Corrigibility Myopia Deconfusion Self Fulfilling/Refuting Prophecies Orthogonality Thesis AI Success Models Treacherous Turn Gradient Hacking Mild Optimization Quantilization

112 posts Iterated Amplification Debate (AI safety technique) Factored Cognition Humans Consulting HCH Experiments Ought AI-assisted Alignment Memory and Mnemonics Air Conditioning Verification

16 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

21 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

250 The Plan - 2022 Update

johnswentworth

19d

33

26 The limited upside of interpretability

Peter S. Park

1mo

11

59 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

235 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

34 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

56 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

34 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

26 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

27 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3

78 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

0 The Opportunity and Risks of Learning Human Values In-Context

Zachary Robertson

10d

4

26 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

184 Godzilla Strategies

johnswentworth

6mo

65

44 Notes on OpenAI’s alignment plan

Alex Flint

12d

5

11 Alignment with argument-networks and assessment-predictions

Tor Økland Barstad

7d

3

21 Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

JanBrauner

19d

3

62 Rant on Problem Factorization for Alignment

johnswentworth

4mo

48

61 Relaxed adversarial training for inner alignment

evhub

3y

28

17 Provably Honest - A First Step

Srijanak De

1mo

2

118 Debate update: Obfuscated arguments problem

Beth Barnes

1y

21

20 AI Safety via Debate

ESRogs

4y

13

13 Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad

6mo

7

80 Air Conditioner Test Results & Discussion

johnswentworth

6mo

38

12 AI-assisted list of ten concrete alignment things to do right now

lcmgcd

3mo

5

139 Paul's research agenda FAQ

zhukeepa

4y

73