Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

3846 posts AI AI Risk GPT AI Timelines Anthropics Machine Learning (ML) AI Takeoff Interpretability (ML & AI) Existential Risk Language Models Conjecture (org) Whole Brain Emulation

302 posts Goodhart's Law Neuroscience Optimization Predictive Processing General Intelligence Inner Alignment Adaptation Executors Superstimuli Neuralink Selection vs Control Brain-Computer Interfaces Neocortex

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

84 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

16 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

198 The next decades might be wild

Marius Hobbhahn

5d

21

6 I believe some AI doomers are overconfident

FTPickle

6h

4

41 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

37 Reframing inner alignment

davidad

9d

13

7 Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois

1d

6

112 Bad at Arithmetic, Promising at Math

cohenmacaulay

2d

17

52 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

47 Next Level Seinfeld

Zvi

1d

6

26 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

11 Will Machines Ever Rule the World? MLAISU W50

Esben Kran

4d

4

140 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

108 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

59 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

33 Don't align agents to evaluations of plans

TurnTrout

24d

46

72 Don't design agents which exploit adversarial inputs

TurnTrout

1mo

61

55 Alignment allows "nonrobust" decision-influences and doesn't require robust grading

TurnTrout

21d

27

21 Value Formation: An Overarching Model

Thane Ruthenis

1mo

6

33 Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight

Jacy Reese Anthis

1mo

8

28 Predictive Processing, Heterosexuality and Delusions of Grandeur

lsusr

3d

2

50 My take on Jacob Cannell’s take on AGI safety

Steven Byrnes

22d

13

47 Humans aren't fitness maximizers

So8res

2mo

45

8 Take 6: CAIS is actually Orwellian.

Charlie Steiner

13d

5

5 Don't you think RLHF solves outer alignment?

Raphaël S

1mo

19

29 Mesa-Optimizers via Grokking

orthonormal

14d

4

93 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13