Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1913 posts AI World Modeling Inner Alignment Rationality Interpretability (ML & AI) AI Timelines Decision Theory GPT Research Agendas Abstraction Value Learning Impact Regularization

855 posts Logical Induction Threat Models Goodhart's Law Practice & Philosophy of Science Logical Uncertainty Intellectual Progress (Society-Level) Radical Probabilism Epistemology Ethics & Morality Software Tools Fiction Bayes' Theorem

28 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

78 Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC

1d

9

13 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

30 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

213 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

73 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

64 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

46 Positive values seem more robust and lasting than prohibitions

TurnTrout

3d

9

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

23 Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen

1d

2

121 The next decades might be wild

Marius Hobbhahn

5d

21

106 Thoughts on AGI organizations and capabilities work

Rob Bensinger

13d

17

38 AI Neorealism: a threat model & success criterion for existential safety

davidad

5d

0

116 Logical induction for software engineers

Alex Flint

17d

2

60 You can still fetch the coffee today if you're dead tomorrow

davidad

11d

15

98 AI will change the world, but won’t take it over by playing “3-dimensional chess”.

boazbarak

28d

86

243 Counterarguments to the basic AI x-risk case

KatjaGrace

2mo

122

20 Reflections on the PIBBSS Fellowship 2022

Nora_Ammann

9d

0

149 Lessons learned from talking to >100 academics about AI safety

Marius Hobbhahn

2mo

16

462 AGI Ruin: A List of Lethalities

Eliezer Yudkowsky

6mo

653

42 Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika

25d

4

27 Deconfusing Direct vs Amortised Optimization

beren

18d

6

113 Niceness is unnatural

So8res

2mo

18

164 Most People Start With The Same Few Bad Ideas

johnswentworth

3mo

30