Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1125 posts AI Research Agendas AI Timelines Value Learning AI Takeoff Embedded Agency Eliciting Latent Knowledge (ELK) Community Reinforcement Learning Iterated Amplification Debate (AI safety technique) Game Theory

321 posts Conjecture (org) GPT Oracle AI Interpretability (ML & AI) Myopia Language Models OpenAI AI Boxing (Containment) Machine Learning (ML) DeepMind Acausal Trade Scaling Laws

62 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

37 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

10 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

21 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

232 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

63 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

92 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

18 Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen

1d

2

159 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

42 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

49 «Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch

6d

2

130 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

34 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

15 Looking for an alignment tutor

JanBrauner

3d

2

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

29 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

265 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

80 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

211 The Plan - 2022 Update

johnswentworth

19d

33

59 Predicting GPU performance

Marius Hobbhahn

6d

24

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

93 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

183 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

47 Reframing inner alignment

davidad

9d

13