Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

153 posts Interpretability (ML & AI) GPT Machine Learning (ML) DeepMind OpenAI Truth, Semantics, & Meaning AI Success Models Bounties & Prizes (active) AI-assisted Alignment Lottery Ticket Hypothesis Computer Science Honesty

168 posts Conjecture (org) Oracle AI Myopia Language Models Refine Deconfusion Agency AI Boxing (Containment) Deceptive Alignment Scaling Laws Deception Acausal Trade

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

223 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

183 The Plan - 2022 Update

johnswentworth

19d

33

48 Predicting GPU performance

Marius Hobbhahn

6d

24

32 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

59 Reframing inner alignment

davidad

9d

13

90 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

37 [ASoT] Natural abstractions and AlphaZero

Ulisse Mini

10d

1

56 Re-Examining LayerNorm

Eric Winsor

19d

8

61 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

18 An exploration of GPT-2's embedding weights

Adam Scherlis

7d

2

11 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

19 My thoughts on OpenAI's Alignment plan

Donald Hobson

10d

0

28 Discovering Language Model Behaviors with Model-Written Evaluations

evhub

4h

3

27 Take 11: "Aligning language models" should be weirder.

Charlie Steiner

2d

0

43 Proper scoring rules don’t guarantee predicting fixed points

Johannes_Treutlein

4d

2

64 [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey

7d

10

143 Conjecture: a retrospective after 8 months of work

Connor Leahy

27d

9

35 Side-channels: input versus output

davidad

8d

9

96 The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren

22d

27

108 What I Learned Running Refine

adamShimi

26d

5

178 Mysteries of mode collapse

janus

1mo

35

41 Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

15d

16

145 Decision theory does not imply that we get to have nice things

So8res

2mo

53

85 Trying to Make a Treacherous Mesa-Optimizer

MadHatter

1mo

13

52 Conjecture Second Hiring Round

Connor Leahy

27d

0

31 Searching for Search

NicholasKees

22d

6