Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

51 posts Machine Learning (ML) DeepMind OpenAI Truth, Semantics, & Meaning Lottery Ticket Hypothesis Honesty Anthropic Map and Territory Calibration

52 posts Interpretability (ML & AI) AI Success Models Conservatism (AI) Principal-Agent Problems Market making (AI safety technique) Empiricism

59 Reframing inner alignment

davidad

9d

13

223 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

199 Common misconceptions about OpenAI

Jacob_Hilton

3mo

138

64 Clarifying AI X-risk

zac_kenton

1mo

23

86 Paper: Discovering novel algorithms with AlphaTensor [Deepmind]

LawrenceC

2mo

18

49 A Data limited future

Donald Hobson

4mo

25

318 DeepMind alignment team opinions on AGI ruin arguments

Vika

4mo

34

51 Steganography in Chain of Thought Reasoning

A Ray

4mo

13

104 Caution when interpreting Deepmind's In-context RL paper

Sam Marks

1mo

6

19 Paper: In-context Reinforcement Learning with Algorithm Distillation [Deepmind]

LawrenceC

1mo

5

46 Prosaic AI alignment

paulfchristiano

4y

10

81 Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible

Sam Bowman

3mo

6

16 Train first VS prune first in neural networks.

Donald Hobson

5mo

5

55 Autonomy as taking responsibility for reference maintenance

Ramana Kumar

4mo

3

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

32 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

183 The Plan - 2022 Update

johnswentworth

19d

33

61 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

51 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

11 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

28 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

55 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

25 Toy Models and Tegum Products

Adam Jermyn

1mo

7

73 Polysemanticity and Capacity in Neural Networks

Buck

2mo

9

69 Engineering Monosemanticity in Toy Models

Adam Jermyn

1mo

6

19 Subsets and quotients in interpretability

Erik Jenner

18d

1

254 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39