Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1125 posts AI Research Agendas AI Timelines Value Learning AI Takeoff Embedded Agency Eliciting Latent Knowledge (ELK) Community Reinforcement Learning Iterated Amplification Debate (AI safety technique) Game Theory

321 posts Conjecture (org) GPT Oracle AI Interpretability (ML & AI) Myopia Language Models OpenAI AI Boxing (Containment) Machine Learning (ML) DeepMind Acausal Trade Scaling Laws

37 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

62 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

252 Reward is not the optimization target

TurnTrout

4mo

97

37 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

92 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

34 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

159 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

77 A shot at the diamond-alignment problem

TurnTrout

2mo

53

60 Don't design agents which exploit adversarial inputs

TurnTrout

1mo

61

36 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

121 Mechanistic anomaly detection and ELK

paulfchristiano

25d

17

42 Four usages of "loss" in AI

TurnTrout

2mo

18

49 «Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch

6d

2

69 Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Akash

28d

20

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

47 Reframing inner alignment

davidad

9d

13

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

211 The Plan - 2022 Update

johnswentworth

19d

33

265 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

80 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

35 Side-channels: input versus output

davidad

8d

9

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

93 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

226 Common misconceptions about OpenAI

Jacob_Hilton

3mo

138

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27