Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1014 posts AI AI Timelines Value Learning AI Takeoff Embedded Agency Community Eliciting Latent Knowledge (ELK) Reinforcement Learning Infra-Bayesianism Counterfactuals Logic & Mathematics Interviews

111 posts Iterated Amplification Game Theory Factored Cognition Humans Consulting HCH Research Agendas Ought Debate (AI safety technique) Risks of Astronomical Suffering (S-risks) Center on Long-Term Risk (CLR) Mechanism Design Fairness Group Rationality

25 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

233 Reward is not the optimization target

TurnTrout

4mo

97

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

136 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

82 A shot at the diamond-alignment problem

TurnTrout

2mo

53

53 Don't design agents which exploit adversarial inputs

TurnTrout

1mo

61

113 Mechanistic anomaly detection and ELK

paulfchristiano

25d

17

42 Four usages of "loss" in AI

TurnTrout

2mo

18

51 Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Akash

28d

20

106 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

213 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

39 In defense of probably wrong mechanistic models

evhub

14d

10

35 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

47 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

54 «Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch

6d

2

52 Notes on OpenAI’s alignment plan

Alex Flint

12d

5

231 On how various plans miss the hard bits of the alignment challenge

So8res

5mo

81

85 Rant on Problem Factorization for Alignment

johnswentworth

4mo

48

64 Relaxed adversarial training for inner alignment

evhub

3y

28

155 «Boundaries», Part 1: a key missing concept from utility theory

Andrew_Critch

4mo

26

98 Threat-Resistant Bargaining Megapost: Introducing the ROSE Value

Diffractor

2mo

11

204 Unifying Bargaining Notions (1/2)

Diffractor

4mo

38

155 Some conceptual alignment research projects

Richard_Ngo

3mo

14

120 Unifying Bargaining Notions (2/2)

Diffractor

4mo

11

118 Paul's research agenda FAQ

zhukeepa

4y

73

35 A Small Negative Result on Debate

Sam Bowman

8mo

11