Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

3846 posts AI AI Risk GPT AI Timelines Anthropics Machine Learning (ML) AI Takeoff Interpretability (ML & AI) Existential Risk Language Models Conjecture (org) Whole Brain Emulation

302 posts Goodhart's Law Neuroscience Optimization Predictive Processing General Intelligence Inner Alignment Adaptation Executors Superstimuli Neuralink Selection vs Control Brain-Computer Interfaces Neocortex

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

62 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

6 Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Akash

2h

0

37 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

45 Next Level Seinfeld

Zvi

1d

6

91 Bad at Arithmetic, Promising at Math

cohenmacaulay

2d

17

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

21 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

153 The next decades might be wild

Marius Hobbhahn

5d

21

232 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

29 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

60 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

30 Predictive Processing, Heterosexuality and Delusions of Grandeur

lsusr

3d

2

96 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

61 My take on Jacob Cannell’s take on AGI safety

Steven Byrnes

22d

13

35 Mesa-Optimizers via Grokking

orthonormal

14d

4

26 Take 8: Queer the inner/outer alignment dichotomy.

Charlie Steiner

11d

2

55 Alignment allows "nonrobust" decision-influences and doesn't require robust grading

TurnTrout

21d

27

87 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

60 Don't design agents which exploit adversarial inputs

TurnTrout

1mo

61

37 Don't align agents to evaluations of plans

TurnTrout

24d

46

77 "Normal" is the equilibrium state of past optimization processes

Alex_Altair

1mo

5

14 Take 6: CAIS is actually Orwellian.

Charlie Steiner

13d

5

34 [Hebbian Natural Abstractions] Introduction

Samuel Nellessen

29d

3

29 Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight

Jacy Reese Anthis

1mo

8