Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1564 posts AI Inner Alignment Interpretability (ML & AI) AI Timelines GPT Research Agendas AI Takeoff Value Learning Machine Learning (ML) Conjecture (org) Mesa-Optimization Outer Alignment

349 posts Abstraction Impact Regularization Rationality World Modeling Decision Theory Human Values Goal-Directedness Anthropics Utility Functions Finite Factored Sets Shard Theory Fixed Point Theorems

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

28 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

25 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

59 Reframing inner alignment

davidad

9d

13

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

233 Reward is not the optimization target

TurnTrout

4mo

97

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

90 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

32 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

35 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

42 Don't align agents to evaluations of plans

TurnTrout

24d

46

183 The Plan - 2022 Update

johnswentworth

19d

33

78 Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC

1d

9

46 Positive values seem more robust and lasting than prohibitions

TurnTrout

3d

9

125 Finite Factored Sets in Pictures

Magdalena Wache

9d

29

58 Alignment allows "nonrobust" decision-influences and doesn't require robust grading

TurnTrout

21d

27

48 Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch

26d

34

93 wrapper-minds are the enemy

nostalgebraist

6mo

36

26 Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight

Jacy Reese Anthis

1mo

8

155 The shard theory of human values

Quintin Pope

3mo

57

28 Traps of Formalization in Deconfusion

adamShimi

1y

7

159 Humans provide an untapped wealth of evidence about alignment

TurnTrout

5mo

92

75 «Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew_Critch

1mo

13

92 Human values & biases are inaccessible to the genome

TurnTrout

5mo

51

78 Builder/Breaker for Deconfusion

abramdemski

2mo

9

239 Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra

5mo

89