Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

2595 posts AI AI Timelines AI Takeoff Interpretability (ML & AI) Careers Instrumental Convergence Iterated Amplification Corrigibility Audio Debate (AI safety technique) Infra-Bayesianism DeepMind

488 posts GPT Conjecture (org) Art Music Machine Learning (ML) Bounties & Prizes (active) OpenAI QURI Language Models Project Announcement DALL-E Meta-Humor

7 Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Akash

2h

0

29 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

33 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

40 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

10 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

199 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

106 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

108 The next decades might be wild

Marius Hobbhahn

5d

21

70 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

15 Solution to The Alignment Problem

Algon

1d

0

95 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

22 Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen

1d

2

54 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

39 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

43 Next Level Seinfeld

Zvi

1d

6

70 Bad at Arithmetic, Promising at Math

cohenmacaulay

2d

17

26 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

11 Does ChatGPT’s performance warrant working on a tutor for children? [It’s time to take it to the lab.]

Bill Benzon

1d

2

59 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

160 Jailbreaking ChatGPT on Release Day

Zvi

18d

74

55 A brainteaser for language models

Adam Scherlis

8d

3

38 Discovering Latent Knowledge in Language Models Without Supervision

Xodarap

6d

1

57 Reframing inner alignment

davidad

9d

13

98 Did ChatGPT just gaslight me?

ThomasW

19d

45

132 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

3 Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois

1d

6

103 What I Learned Running Refine

adamShimi

26d

5