Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

153 posts Interpretability (ML & AI) GPT Machine Learning (ML) DeepMind OpenAI Truth, Semantics, & Meaning AI Success Models Bounties & Prizes (active) AI-assisted Alignment Lottery Ticket Hypothesis Computer Science Honesty

168 posts Conjecture (org) Oracle AI Myopia Language Models Refine Deconfusion Agency AI Boxing (Containment) Deceptive Alignment Scaling Laws Deception Acausal Trade

13 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

47 Reframing inner alignment

davidad

9d

13

123 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

26 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

211 The Plan - 2022 Update

johnswentworth

19d

33

265 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

57 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

93 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

226 Common misconceptions about OpenAI

Jacob_Hilton

3mo

138

47 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

22 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

29 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

102 Clarifying AI X-risk

zac_kenton

1mo

23

26 An exploration of GPT-2's embedding weights

Adam Scherlis

7d

2

27 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

80 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

35 Side-channels: input versus output

davidad

8d

9

55 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

159 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

472 Simulators

janus

3mo

103

37 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

142 Decision theory does not imply that we get to have nice things

So8res

2mo

53

213 Mysteries of mode collapse

janus

1mo

35

27 Inverse scaling can become U-shaped

Edouard Harris

1mo

15

183 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

103 What I Learned Running Refine

adamShimi

26d

5

52 Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. Murphy

2mo

14

72 How likely is deceptive alignment?

evhub

3mo

21