Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1913 posts AI World Modeling Inner Alignment Rationality Interpretability (ML & AI) AI Timelines Decision Theory GPT Research Agendas Abstraction Value Learning Impact Regularization

855 posts Logical Induction Threat Models Goodhart's Law Practice & Philosophy of Science Logical Uncertainty Intellectual Progress (Society-Level) Radical Probabilism Epistemology Ethics & Morality Software Tools Fiction Bayes' Theorem

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

70 Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC

1d

9

62 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

37 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

10 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

21 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

232 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

60 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

148 Finite Factored Sets in Pictures

Magdalena Wache

9d

29

42 Positive values seem more robust and lasting than prohibitions

TurnTrout

3d

9

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

155 The next decades might be wild

Marius Hobbhahn

5d

21

39 AI Neorealism: a threat model & success criterion for existential safety

davidad

5d

0

94 Thoughts on AGI organizations and capabilities work

Rob Bensinger

13d

17

124 Logical induction for software engineers

Alex Flint

17d

2

58 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

336 Counterarguments to the basic AI x-risk case

KatjaGrace

2mo

122

31 Reflections on the PIBBSS Fellowship 2022

Nora_Ammann

9d

0

103 AI will change the world, but won’t take it over by playing “3-dimensional chess”.

boazbarak

28d

86

48 Deconfusing Direct vs Amortised Optimization

beren

18d

6

724 AGI Ruin: A List of Lethalities

Eliezer Yudkowsky

6mo

653

207 Lessons learned from talking to >100 academics about AI safety

Marius Hobbhahn

2mo

16

36 Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika

25d

4

161 Most People Start With The Same Few Bad Ideas

johnswentworth

3mo

30

98 Niceness is unnatural

So8res

2mo

18