Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

734 posts AI Eliciting Latent Knowledge (ELK) Infra-Bayesianism Counterfactuals Logic & Mathematics Interviews Audio AXRP Redwood Research Transcripts Formal Proof Domain Theory

74 posts Embedded Agency Reinforcement Learning Subagents Reward Functions EfficientZero Robust Agents Wireheading AI Capabilities Spurious Counterfactuals Category Theory Memetics Tradeoffs

25 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

136 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

82 A shot at the diamond-alignment problem

TurnTrout

2mo

53

113 Mechanistic anomaly detection and ELK

paulfchristiano

25d

17

106 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

213 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

39 In defense of probably wrong mechanistic models

evhub

14d

10

64 Verification Is Not Easier Than Generation In General

johnswentworth

14d

23

106 Finding gliders in the game of life

paulfchristiano

19d

7

32 Concept extrapolation for hypothesis generation

Stuart_Armstrong

8d

2

65 Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus

1mo

8

233 Reward is not the optimization target

TurnTrout

4mo

97

42 Four usages of "loss" in AI

TurnTrout

2mo

18

81 Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes

3mo

7

80 Seriously, what goes wrong with "reward the agent when it makes you smile"?

TurnTrout

4mo

41

83 Scaling Laws for Reward Model Overoptimization

leogao

2mo

11

37 Gradations of Agency

Daniel Kokotajlo

7mo

6

38 It matters when the first sharp left turn happens

Adam Jermyn

2mo

9

77 Towards deconfusing wireheading and reward maximization

leogao

3mo

7

38 Conditioning, Prompts, and Fine-Tuning

Adam Jermyn

4mo

9

134 Why Subagents?

johnswentworth

3y

42

259 Humans are very reliable agents

alyssavance

6mo

35

29 The reward engineering problem

paulfchristiano

3y

3

37 Committing, Assuming, Externalizing, and Internalizing

Scott Garrabrant

2y

25

39 Eight Definitions of Observability

Scott Garrabrant

2y

26