Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

1125 posts AI Research Agendas AI Timelines Value Learning AI Takeoff Embedded Agency Eliciting Latent Knowledge (ELK) Community Reinforcement Learning Iterated Amplification Debate (AI safety technique) Game Theory

321 posts Conjecture (org) GPT Oracle AI Interpretability (ML & AI) Myopia Language Models OpenAI AI Boxing (Containment) Machine Learning (ML) DeepMind Acausal Trade Scaling Laws

79 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

39 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

251 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

7 Note on algorithms with multiple trained components

Steven Byrnes

7h

1

12 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

53 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

85 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

182 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

13 Event [Berkeley]: Alignment Collaborator Speed-Meeting

AlexMennen

1d

2

154 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

44 «Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch

6d

2

49 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

29 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

33 My AGI safety research—2022 review, ’23 plans

Steven Byrnes

6d

6

26 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

15 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

132 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

67 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

31 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

307 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

96 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

70 Predicting GPU performance

Marius Hobbhahn

6d

24

239 The Plan - 2022 Update

johnswentworth

19d

33

222 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

223 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

142 Re-Examining LayerNorm

Eric Winsor

19d

8

96 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

33 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2