Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

153 posts Interpretability (ML & AI) GPT Machine Learning (ML) DeepMind OpenAI Truth, Semantics, & Meaning AI Success Models Bounties & Prizes (active) AI-assisted Alignment Lottery Ticket Hypothesis Computer Science Honesty

168 posts Conjecture (org) Oracle AI Myopia Language Models Refine Deconfusion Agency AI Boxing (Containment) Deceptive Alignment Scaling Laws Deception Acausal Trade

15 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

35 Reframing inner alignment

davidad

9d

13

132 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

20 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

239 The Plan - 2022 Update

johnswentworth

19d

33

307 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

53 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

96 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

253 Common misconceptions about OpenAI

Jacob_Hilton

3mo

138

43 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

33 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

30 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

140 Clarifying AI X-risk

zac_kenton

1mo

23

34 An exploration of GPT-2's embedding weights

Adam Scherlis

7d

2

26 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

96 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

35 Side-channels: input versus output

davidad

8d

9

67 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

222 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

759 Simulators

janus

3mo

103

33 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

139 Decision theory does not imply that we get to have nice things

So8res

2mo

53

248 Mysteries of mode collapse

janus

1mo

35

26 Inverse scaling can become U-shaped

Edouard Harris

1mo

15

223 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

98 What I Learned Running Refine

adamShimi

26d

5

64 Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. Murphy

2mo

14

80 How likely is deceptive alignment?

evhub

3mo

21