Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

808 posts AI Embedded Agency Eliciting Latent Knowledge (ELK) Reinforcement Learning Infra-Bayesianism Counterfactuals Logic & Mathematics AI Capabilities Interviews Audio Subagents Wireheading

126 posts Value Learning Inverse Reinforcement Learning Machine Intelligence Research Institute (MIRI) Agent Foundations Meta-Philosophy Metaethics Community Philosophy The Pointers Problem Moral Uncertainty Cognitive Reduction Center for Human-Compatible AI (CHAI)

13 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

30 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

213 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

73 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

55 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

136 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

106 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

106 Finding gliders in the game of life

paulfchristiano

19d

7

37 Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner

7d

3

64 Verification Is Not Easier Than Generation In General

johnswentworth

14d

23

32 Concept extrapolation for hypothesis generation

Stuart_Armstrong

8d

2

23 Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen

1d

2

17 Looking for an alignment tutor

JanBrauner

3d

2

51 Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Akash

28d

20

53 Don't design agents which exploit adversarial inputs

TurnTrout

1mo

61

39 People care about each other even though they have imperfect motivational pointers?

TurnTrout

1mo

25

40 Beyond Kolmogorov and Shannon

Alexander Gietelink Oldenziel

1mo

14

197 Why Agent Foundations? An Overly Abstract Explanation

johnswentworth

9mo

54

25 A newcomer’s guide to the technical AI safety field

zeshen

1mo

1

45 Clarifying the Agent-Like Structure Problem

johnswentworth

2mo

14

57 AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022

Sam Bowman

3mo

2

70 Encultured AI Pre-planning, Part 1: Enabling New Benchmarks

Andrew_Critch

4mo

2

45 Prize and fast track to alignment research at ALTER

Vanessa Kosoy

3mo

4

59 Seeking Interns/RAs for Mechanistic Interpretability Projects

Neel Nanda

4mo

0

12 The Slippery Slope from DALLE-2 to Deepfake Anarchy

scasper

1mo

9