Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1014 posts AI AI Timelines Value Learning AI Takeoff Embedded Agency Community Eliciting Latent Knowledge (ELK) Reinforcement Learning Infra-Bayesianism Counterfactuals Logic & Mathematics Interviews

111 posts Iterated Amplification Game Theory Factored Cognition Humans Consulting HCH Research Agendas Ought Debate (AI safety technique) Risks of Astronomical Suffering (S-risks) Center on Long-Term Risk (CLR) Mechanism Design Fairness Group Rationality

13 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

30 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

213 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

73 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

23 Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen

1d

2

55 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

136 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

17 Looking for an alignment tutor

JanBrauner

3d

2

106 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

106 Finding gliders in the game of life

paulfchristiano

19d

7

37 Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner

7d

3

54 «Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch

6d

2

35 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

47 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

52 Notes on OpenAI’s alignment plan

Alex Flint

12d

5

231 On how various plans miss the hard bits of the alignment challenge

So8res

5mo

81

155 Some conceptual alignment research projects

Richard_Ngo

3mo

14

204 Unifying Bargaining Notions (1/2)

Diffractor

4mo

38

98 Threat-Resistant Bargaining Megapost: Introducing the ROSE Value

Diffractor

2mo

11

16 Theories of impact for Science of Deep Learning

Marius Hobbhahn

19d

0

155 «Boundaries», Part 1: a key missing concept from utility theory

Andrew_Critch

4mo

26

120 Unifying Bargaining Notions (2/2)

Diffractor

4mo

11

85 Rant on Problem Factorization for Alignment

johnswentworth

4mo

48

122 Supervise Process, not Outcomes

stuhlmueller

8mo

8

31 A Library and Tutorial for Factored Cognition with Language Models

stuhlmueller

2mo

0