Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1564 posts AI Inner Alignment Interpretability (ML & AI) AI Timelines GPT Research Agendas AI Takeoff Value Learning Machine Learning (ML) Conjecture (org) Mesa-Optimization Outer Alignment

349 posts Abstraction Impact Regularization Rationality World Modeling Decision Theory Human Values Goal-Directedness Anthropics Utility Functions Finite Factored Sets Shard Theory Fixed Point Theorems

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

62 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

37 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

10 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

21 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

232 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

60 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

29 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

92 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

265 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

70 Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC

1d

9

148 Finite Factored Sets in Pictures

Magdalena Wache

9d

29

42 Positive values seem more robust and lasting than prohibitions

TurnTrout

3d

9

47 Take 7: You should talk about "the human's utility function" less.

Charlie Steiner

12d

22

777 Where I agree and disagree with Eliezer

paulfchristiano

6mo

205

55 Alignment allows "nonrobust" decision-influences and doesn't require robust grading

TurnTrout

21d

27

43 Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew_Critch

26d

34

310 Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra

5mo

89

202 The shard theory of human values

Quintin Pope

3mo

57

84 Contra shard theory, in the context of the diamond maximizer problem

So8res

2mo

16

56 «Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew_Critch

1mo

13

51 Humans do acausal coordination all the time

Adam Jermyn

1mo

36

175 Humans provide an untapped wealth of evidence about alignment

TurnTrout

5mo

92

29 Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight

Jacy Reese Anthis

1mo

8