Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

3083 posts AI GPT AI Timelines Machine Learning (ML) AI Takeoff Interpretability (ML & AI) Language Models Conjecture (org) Careers Instrumental Convergence Iterated Amplification Art

763 posts Anthropics Existential Risk Whole Brain Emulation Sleeping Beauty Paradox Threat Models Academic Papers Space Exploration & Colonization Great Filter Paradoxes Extraterrestrial Life Pascal's Mugging Longtermism

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

7 Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic

Akash

2h

0

29 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

33 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

40 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

43 Next Level Seinfeld

Zvi

1d

6

70 Bad at Arithmetic, Promising at Math

cohenmacaulay

2d

17

10 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

199 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

106 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

108 The next decades might be wild

Marius Hobbhahn

5d

21

70 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

15 Solution to The Alignment Problem

Algon

1d

0

95 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

36 AI Neorealism: a threat model & success criterion for existential safety

davidad

5d

0

59 AI Safety Seems Hard to Measure

HoldenKarnofsky

12d

5

58 Who are some prominent reasonable people who are confident that AI won't kill everyone?

Optimization Process

15d

40

93 AI will change the world, but won’t take it over by playing “3-dimensional chess”.

boazbarak

28d

86

87 Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)

Jacy Reese Anthis

28d

64

15 all claw, no world — and other thoughts on the universal distribution

carado

6d

0

217 Counterarguments to the basic AI x-risk case

KatjaGrace

2mo

122

63 Could a single alien message destroy us?

Writer

25d

23

515 Where I agree and disagree with Eliezer

paulfchristiano

6mo

205

41 Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika

25d

4

405 AGI Ruin: A List of Lethalities

Eliezer Yudkowsky

6mo

653

79 Am I secretly excited for AI getting weird?

porby

1mo

4

109 Niceness is unnatural

So8res

2mo

18

72 Far-UVC Light Update: No, LEDs are not around the corner (tweetstorm)

Davidmanheim

1mo

27