Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

2595 posts AI AI Timelines AI Takeoff Interpretability (ML & AI) Careers Instrumental Convergence Iterated Amplification Corrigibility Audio Debate (AI safety technique) Infra-Bayesianism DeepMind

488 posts GPT Conjecture (org) Art Music Machine Learning (ML) Bounties & Prizes (active) OpenAI QURI Language Models Project Announcement DALL-E Meta-Humor

62 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

6 Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Akash

2h

0

37 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

21 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

153 The next decades might be wild

Marius Hobbhahn

5d

21

232 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

92 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

3 I believe some AI doomers are overconfident

FTPickle

6h

4

4 (Extremely) Naive Gradient Hacking Doesn't Work

ojorgensen

9h

0

265 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

45 Next Level Seinfeld

Zvi

1d

6

91 Bad at Arithmetic, Promising at Math

cohenmacaulay

2d

17

29 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

13 Does ChatGPT’s performance warrant working on a tutor for children? [It’s time to take it to the lab.]

Bill Benzon

1d

2

237 Jailbreaking ChatGPT on Release Day

Zvi

18d

74

80 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

45 Discovering Latent Knowledge in Language Models Without Supervision

Xodarap

6d

1

5 Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois

1d

6

183 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

123 Did ChatGPT just gaslight me?

ThomasW

19d

45

46 A brainteaser for language models

Adam Scherlis

8d

3

47 Reframing inner alignment

davidad

9d

13

213 Mysteries of mode collapse

janus

1mo

35