Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

734 posts AI Eliciting Latent Knowledge (ELK) Infra-Bayesianism Counterfactuals Logic & Mathematics Interviews Audio AXRP Redwood Research Transcripts Formal Proof Domain Theory

74 posts Embedded Agency Reinforcement Learning Subagents Reward Functions EfficientZero Robust Agents Wireheading AI Capabilities Spurious Counterfactuals Category Theory Memetics Tradeoffs

79 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

39 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

251 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

12 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

53 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

85 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

182 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

154 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

49 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

29 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

129 Mechanistic anomaly detection and ELK

paulfchristiano

25d

17

76 Finding gliders in the game of life

paulfchristiano

19d

7

23 Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner

7d

3

503 (My understanding of) What Everyone in Technical Alignment is Doing and Why

Thomas Larsen

3mo

83

7 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

104 Will we run out of ML data? Evidence from projecting dataset size trends

Pablo Villalobos

1mo

12

271 Reward is not the optimization target

TurnTrout

4mo

97

43 A Short Dialogue on the Meaning of Reward Functions

Leon Lang

1mo

0

89 Scaling Laws for Reward Model Overoptimization

leogao

2mo

11

237 Humans are very reliable agents

alyssavance

6mo

35

107 Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes

3mo

7

334 EfficientZero: How It Works

1a3orn

1y

42

61 Towards deconfusing wireheading and reward maximization

leogao

3mo

7

42 Four usages of "loss" in AI

TurnTrout

2mo

18

72 Seriously, what goes wrong with "reward the agent when it makes you smile"?

TurnTrout

4mo

41

120 We have achieved Noob Gains in AI

phdead

7mo

21

32 It matters when the first sharp left turn happens

Adam Jermyn

2mo

9

124 EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

gwern

1y

52