Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

103 posts Interpretability (ML & AI) Machine Learning (ML) DeepMind Truth, Semantics, & Meaning AI Success Models OpenAI Lottery Ticket Hypothesis Anthropic Conservatism (AI) Honesty Principal-Agent Problems Map and Territory

50 posts GPT Bounties & Prizes (active) AI-assisted Alignment Moore's Law Compute Nanotechnology List of Links AI Safety Public Materials Computer Science Tripwire Quantum Mechanics

11 An Open Agency Architecture for Safe Transformative AI

davidad

11h

11

59 Reframing inner alignment

davidad

9d

13

114 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

32 Paper: Transformers learn in-context by gradient descent

LawrenceC

4d

11

183 The Plan - 2022 Update

johnswentworth

19d

33

223 A challenge for AGI organizations, and a challenge for readers

Rob Bensinger

19d

30

61 Multi-Component Learning and S-Curves

Adam Jermyn

20d

24

199 Common misconceptions about OpenAI

Jacob_Hilton

3mo

138

51 "Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)

1mo

25

11 Extracting and Evaluating Causal Direction in LLMs' Activations

Fabien Roger

6d

2

28 A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda

1mo

15

64 Clarifying AI X-risk

zac_kenton

1mo

23

55 Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda

1mo

14

25 Toy Models and Tegum Products

Adam Jermyn

1mo

7

90 [Link] Why I’m optimistic about OpenAI’s alignment approach

janleike

15d

13

18 An exploration of GPT-2's embedding weights

Adam Scherlis

7d

2

12 Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

JanBrauner

19d

3

12 [LINK] - ChatGPT discussion

JanBrauner

19d

7

8 Distribution Shifts and The Importance of AI Safety

Leon Lang

2mo

2

5 AI-assisted list of ten concrete alignment things to do right now

lcmgcd

3mo

5

68 NeurIPS ML Safety Workshop 2022

Dan H

4mo

2

22 [$20K in Prizes] AI Safety Arguments Competition

Dan H

7mo

543

125 Developmental Stages of GPTs

orthonormal

2y

74

0 New(ish) AI control ideas

Stuart_Armstrong

5y

0

3 Corrigibility thoughts I: caring about multiple things

Stuart_Armstrong

5y

0

96 Collection of GPT-3 results

Kaj_Sotala

2y

24

145 interpreting GPT: the logit lens

nostalgebraist

2y

32

164 MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger

1y

13