Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1446 posts AI Interpretability (ML & AI) AI Timelines GPT Research Agendas Value Learning AI Takeoff Conjecture (org) Embedded Agency Machine Learning (ML) Eliciting Latent Knowledge (ELK) Community

118 posts Inner Alignment Optimization Solomonoff Induction Predictive Processing Selection vs Control Neocortex Mesa-Optimization Neuroscience Priors AI Services (CAIS) Occam's Razor General Intelligence

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

28 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

25 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

59 Reframing inner alignment

davidad

9d

13

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

233 Reward is not the optimization target

TurnTrout

4mo

97

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

32 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

35 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

183 The Plan - 2022 Update

johnswentworth

19d

33

136 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

223 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

90 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

42 Don't align agents to evaluations of plans

TurnTrout

24d

46

20 Value Formation: An Overarching Model

Thane Ruthenis

1mo

6

64 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

46 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3

75 My take on Jacob Cannell’s take on AGI safety

Steven Byrnes

22d

13

20 Take 6: CAIS is actually Orwellian.

Charlie Steiner

13d

5

42 Mesa-Optimizers via Grokking

orthonormal

14d

4

45 Threat Model Literature Review

zac_kenton

1mo

4

79 Externalized reasoning oversight: a research direction for language model alignment

tamera

4mo

22

33 Framing AI Childhoods

David Udell

3mo

8

95 What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?

johnswentworth

4mo

15

59 Humans aren't fitness maximizers

So8res

2mo

45

71 Human Mimicry Mainly Works When We’re Already Close

johnswentworth

4mo

16