Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

246 posts Interpretability (ML & AI) Instrumental Convergence Corrigibility Myopia Deconfusion Self Fulfilling/Refuting Prophecies Orthogonality Thesis AI Success Models Treacherous Turn Gradient Hacking Mild Optimization Quantilization

112 posts Iterated Amplification Debate (AI safety technique) Factored Cognition Humans Consulting HCH Experiments Ought AI-assisted Alignment Memory and Mnemonics Air Conditioning Verification

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

4 (Extremely) Naive Gradient Hacking Doesn't Work

ojorgensen

9h

0

211 The Plan - 2022 Update

johnswentworth

19d

33

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

58 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

99 Re-Examining LayerNorm

Eric Winsor

19d

8

22 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

31 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

72 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

36 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

47 Notes on OpenAI’s alignment plan

Alex Flint

12d

5

7 Alignment with argument-networks and assessment-predictions

Tor Økland Barstad

7d

3

16 Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

JanBrauner

19d

3

92 Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie

3mo

4

151 Godzilla Strategies

johnswentworth

6mo

65

47 A Library and Tutorial for Factored Cognition with Language Models

stuhlmueller

2mo

0

73 Rant on Problem Factorization for Alignment

johnswentworth

4mo

48

80 Air Conditioner Test Results & Discussion

johnswentworth

6mo

38

118 Supervise Process, not Outcomes

stuhlmueller

8mo

8

109 Preregistration: Air Conditioner Test

johnswentworth

8mo

64

35 Ought will host a factored cognition “Lab Meeting”

jungofthewon

3mo

1

10 Provably Honest - A First Step

Srijanak De

1mo

2

20 Briefly thinking through some analogs of debate

Eli Tyre

3mo

3