Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1913 posts AI World Modeling Inner Alignment Rationality Interpretability (ML & AI) AI Timelines Decision Theory GPT Research Agendas Abstraction Value Learning Impact Regularization

855 posts Logical Induction Threat Models Goodhart's Law Practice & Philosophy of Science Logical Uncertainty Intellectual Progress (Society-Level) Radical Probabilism Epistemology Ethics & Morality Software Tools Fiction Bayes' Theorem

15 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

62 Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC

1d

9

26 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

49 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

35 Reframing inner alignment

davidad

9d

13

38 Positive values seem more robust and lasting than prohibitions

TurnTrout

3d

9

79 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

271 Reward is not the optimization target

TurnTrout

4mo

97

39 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

102 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

85 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

132 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

20 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

33 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

189 The next decades might be wild

Marius Hobbhahn

5d

21

56 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

141 Worlds Where Iterative Design Fails

johnswentworth

3mo

26

108 AI will change the world, but won’t take it over by playing “3-dimensional chess”.

boazbarak

28d

86

82 Thoughts on AGI organizations and capabilities work

Rob Bensinger

13d

17

429 Counterarguments to the basic AI x-risk case

KatjaGrace

2mo

122

44 AI X-risk >35% mostly based on a recent peer-reviewed argument

michaelcohen

1mo

31

69 Deconfusing Direct vs Amortised Optimization

beren

18d

6

9 Corrigibility Via Thought-Process Deference

Thane Ruthenis

26d

5

67 We may be able to see sharp left turns coming

Ethan Perez

3mo

26

55 Methodological Therapy: An Agenda For Tackling Research Bottlenecks

adamShimi

2mo

6

80 Don't leave your fingerprints on the future

So8res

2mo

32

91 Oversight Misses 100% of Thoughts The AI Does Not Think

johnswentworth

4mo

49

144 Your posts should be on arXiv

JanBrauner

3mo

39