Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

153 posts Interpretability (ML & AI) GPT Machine Learning (ML) DeepMind OpenAI Truth, Semantics, & Meaning AI Success Models Bounties & Prizes (active) AI-assisted Alignment Lottery Ticket Hypothesis Computer Science Honesty

168 posts Conjecture (org) Oracle AI Myopia Language Models Refine Deconfusion Agency AI Boxing (Containment) Deceptive Alignment Scaling Laws Deception Acausal Trade

15 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

132 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

307 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

70 Predicting GPU performance

Marius Hobbhahn

6d

24

239 The Plan - 2022 Update

johnswentworth

19d

33

142 Re-Examining LayerNorm

Eric Winsor

19d

8

96 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

33 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

20 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

34 An exploration of GPT-2's embedding weights

Adam Scherlis

7d

2

35 Reframing inner alignment

davidad

9d

13

62 [ASoT] Finetuning, RL, and GPT's world prior

Jozdien

18d

8

25 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

53 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

26 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

67 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

31 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

96 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

222 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

223 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

759 Simulators

janus

3mo

103

248 Mysteries of mode collapse

janus

1mo

35

35 Side-channels: input versus output

davidad

8d

9

97 Searching for Search

NicholasKees

22d

6

118 Conjecture Second Hiring Round

Connor Leahy

27d

0

98 What I Learned Running Refine

adamShimi

26d

5

123 Current themes in mechanistic interpretability research

Lee Sharkey

1mo

3

494 chinchilla's wild implications

nostalgebraist

4mo

114