Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

246 posts Interpretability (ML & AI) Instrumental Convergence Corrigibility Myopia Deconfusion Self Fulfilling/Refuting Prophecies Orthogonality Thesis AI Success Models Treacherous Turn Gradient Hacking Mild Optimization Quantilization

112 posts Iterated Amplification Debate (AI safety technique) Factored Cognition Humans Consulting HCH Experiments Ought AI-assisted Alignment Memory and Mnemonics Air Conditioning Verification

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

211 The Plan - 2022 Update

johnswentworth

19d

33

13 The limited upside of interpretability

Peter S. Park

1mo

11

58 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

36 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

32 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

36 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3

60 By Default, GPTs Think In Plain Sight

Fabien Roger

1mo

16

1 The Opportunity and Risks of Learning Human Values In-Context

Zachary Robertson

10d

4

36 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

151 Godzilla Strategies

johnswentworth

6mo

65

47 Notes on OpenAI’s alignment plan

Alex Flint

12d

5

7 Alignment with argument-networks and assessment-predictions

Tor Økland Barstad

7d

3

16 Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

JanBrauner

19d

3

73 Rant on Problem Factorization for Alignment

johnswentworth

4mo

48

61 Relaxed adversarial training for inner alignment

evhub

3y

28

10 Provably Honest - A First Step

Srijanak De

1mo

2

125 Debate update: Obfuscated arguments problem

Beth Barnes

1y

21

27 AI Safety via Debate

ESRogs

4y

13

9 Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad

6mo

7

80 Air Conditioner Test Results & Discussion

johnswentworth

6mo

38

8 AI-assisted list of ten concrete alignment things to do right now

lcmgcd

3mo

5

125 Paul's research agenda FAQ

zhukeepa

4y

73