Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

2237 posts AI AI Timelines AI Takeoff Careers Audio Infra-Bayesianism DeepMind Interviews SERI MATS Dialogue (format) Agent Foundations Redwood Research

358 posts Iterated Amplification Myopia Factored Cognition Humans Consulting HCH Corrigibility Interpretability (ML & AI) Debate (AI safety technique) Experiments Self Fulfilling/Refuting Prophecies Ought Orthogonality Thesis Instrumental Convergence

62 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

153 The next decades might be wild

Marius Hobbhahn

5d

21

3 I believe some AI doomers are overconfident

FTPickle

6h

4

37 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

37 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

12 Will Machines Ever Rule the World? MLAISU W50

Esben Kran

4d

4

92 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

221 AGI Safety FAQ / all-dumb-questions-allowed thread

Aryeh Englander

6mo

514

8 Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Remmelt

1d

6

159 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

27 If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel Schulz

12d

5

92 Revisiting algorithmic progress

Tamay

7d

6

59 Predicting GPU performance

Marius Hobbhahn

6d

24

33 Is the AI timeline too short to have children?

Yoreth

6d

20

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

36 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

211 The Plan - 2022 Update

johnswentworth

19d

33

13 The limited upside of interpretability

Peter S. Park

1mo

11

58 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

36 Empowerment is (almost) All We Need

jacob_cannell

1mo

43

151 Godzilla Strategies

johnswentworth

6mo

65

32 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

36 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3