Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

153 posts Interpretability (ML & AI) GPT Machine Learning (ML) DeepMind OpenAI Truth, Semantics, & Meaning AI Success Models Bounties & Prizes (active) AI-assisted Alignment Lottery Ticket Hypothesis Computer Science Honesty

168 posts Conjecture (org) Oracle AI Myopia Language Models Refine Deconfusion Agency AI Boxing (Containment) Deceptive Alignment Scaling Laws Deception Acausal Trade

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

265 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

211 The Plan - 2022 Update

johnswentworth

19d

33

59 Predicting GPU performance

Marius Hobbhahn

6d

24

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

93 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

47 Reframing inner alignment

davidad

9d

13

99 Re-Examining LayerNorm

Eric Winsor

19d

8

22 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

26 An exploration of GPT-2's embedding weights

Adam Scherlis

7d

2

31 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

364 DeepMind alignment team opinions on AGI ruin arguments

Vika

4mo

34

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

29 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

80 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

183 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

35 Side-channels: input versus output

davidad

8d

9

213 Mysteries of mode collapse

janus

1mo

35

103 What I Learned Running Refine

adamShimi

26d

5

472 Simulators

janus

3mo

103

85 Conjecture Second Hiring Round

Connor Leahy

27d

0

64 Searching for Search

NicholasKees

22d

6

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

82 Current themes in mechanistic interpretability research

Lee Sharkey

1mo

3