Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

2237 posts AI AI Timelines AI Takeoff Careers Audio Infra-Bayesianism DeepMind Interviews SERI MATS Dialogue (format) Agent Foundations Redwood Research

358 posts Iterated Amplification Myopia Factored Cognition Humans Consulting HCH Corrigibility Interpretability (ML & AI) Debate (AI safety technique) Experiments Self Fulfilling/Refuting Prophecies Ought Orthogonality Thesis Instrumental Convergence

84 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

41 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

5 Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Akash

2h

0

198 The next decades might be wild

Marius Hobbhahn

5d

21

265 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

6 I believe some AI doomers are overconfident

FTPickle

6h

4

5 Career Scouting: Housing Coordination

koratkar

5h

0

13 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

71 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

19 Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Remmelt

1d

6

323 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

107 Okay, I feel it now

g1

7d

14

111 Revisiting algorithmic progress

Tamay

7d

6

89 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

16 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

6 (Extremely) Naive Gradient Hacking Doesn't Work

ojorgensen

9h

0

56 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

250 The Plan - 2022 Update

johnswentworth

19d

33

235 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

151 Re-Examining LayerNorm

Eric Winsor

19d

8

35 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

21 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

59 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

44 Notes on OpenAI’s alignment plan

Alex Flint

12d

5

26 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

26 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

56 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24