Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

123 posts Interpretability (ML & AI) Myopia Self Fulfilling/Refuting Prophecies AI Success Models Lottery Ticket Hypothesis Conservatism (AI) Luck Market making (AI safety technique) Verification

123 posts Instrumental Convergence Corrigibility Deconfusion Orthogonality Thesis Treacherous Turn Gradient Hacking Mild Optimization Quantilization Satisficer Gradient Descent Tripwire

230 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

174 Self-fulfilling correlations

PhilGoetz

12y

50

172 The Plan - 2022 Update

johnswentworth

19d

33

158 MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger

1y

13

149 A transparency and interpretability tech tree

evhub

6mo

10

138 Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner

1y

16

130 Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth

6mo

52

111 Conversation with Eliezer: What do you want the system to do?

Akash

5mo

38

108 The case for becoming a black-box investigator of language models

Buck

7mo

19

106 Understanding “Deep Double Descent”

evhub

3y

51

106 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

5d

18

103 The Credit Assignment Problem

abramdemski

3y

40

97 Search versus design

Alex Flint

2y

41

97 Gradations of Inner Alignment Obstacles

abramdemski

1y

22

171 Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace

3y

60

134 Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res

1y

35

132 Sorting Pebbles Into Correct Heaps

Eliezer Yudkowsky

14y

109

127 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

125 Goal retention discussion with Eliezer

MaxTegmark

8y

26

113 A broad basin of attraction around human values?

Wei_Dai

8mo

16

112 Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout

3y

38

108 Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

4mo

8

98 Coherence arguments imply a force for goal-directed behavior

KatjaGrace

1y

27

97 Gradient hacking

evhub

3y

39

77 Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

TurnTrout

1y

8

73 Corrigibility Can Be VNM-Incoherent

TurnTrout

1y

24

73 Introducing Corrigibility (an FAI research subfield)

So8res

8y

28

72 Boeing 737 MAX MCAS as an agent corrigibility failure

shminux

3y

3