Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1446 posts AI Interpretability (ML & AI) AI Timelines GPT Research Agendas Value Learning AI Takeoff Conjecture (org) Embedded Agency Machine Learning (ML) Eliciting Latent Knowledge (ELK) Community

118 posts Inner Alignment Optimization Solomonoff Induction Predictive Processing Selection vs Control Neocortex Mesa-Optimization Neuroscience Priors AI Services (CAIS) Occam's Razor General Intelligence

28 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

13 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

30 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

213 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

73 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

23 Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen

1d

2

27 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

55 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

43 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

64 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

90 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

75 My take on Jacob Cannell’s take on AGI safety

Steven Byrnes

22d

13

42 Mesa-Optimizers via Grokking

orthonormal

14d

4

29 Take 8: Queer the inner/outer alignment dichotomy.

Charlie Steiner

11d

2

42 Don't align agents to evaluations of plans

TurnTrout

24d

46

20 Take 6: CAIS is actually Orwellian.

Charlie Steiner

13d

5

45 Threat Model Literature Review

zac_kenton

1mo

4

59 Humans aren't fitness maximizers

So8res

2mo

45

95 What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?

johnswentworth

4mo

15

20 Value Formation: An Overarching Model

Thane Ruthenis

1mo

6

71 Human Mimicry Mainly Works When We’re Already Close

johnswentworth

4mo

16

79 Externalized reasoning oversight: a research direction for language model alignment

tamera

4mo

22

23 Greed Is the Root of This Evil

Thane Ruthenis

2mo

4