Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1125 posts AI Research Agendas AI Timelines Value Learning AI Takeoff Embedded Agency Eliciting Latent Knowledge (ELK) Community Reinforcement Learning Iterated Amplification Debate (AI safety technique) Game Theory

321 posts Conjecture (org) GPT Oracle AI Interpretability (ML & AI) Myopia Language Models OpenAI AI Boxing (Containment) Machine Learning (ML) DeepMind Acausal Trade Scaling Laws

25 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

45 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

233 Reward is not the optimization target

TurnTrout

4mo

97

35 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

99 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

35 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

136 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

82 A shot at the diamond-alignment problem

TurnTrout

2mo

53

53 Don't design agents which exploit adversarial inputs

TurnTrout

1mo

61

47 Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Charlie Steiner

8d

14

113 Mechanistic anomaly detection and ELK

paulfchristiano

25d

17

42 Four usages of "loss" in AI

TurnTrout

2mo

18

54 «Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch

6d

2

51 Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Akash

28d

20

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

28 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

59 Reframing inner alignment

davidad

9d

13

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

32 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

183 The Plan - 2022 Update

johnswentworth

19d

33

223 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

64 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

35 Side-channels: input versus output

davidad

8d

9

61 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

90 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

43 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

199 Common misconceptions about OpenAI

Jacob_Hilton

3mo

138

96 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27