Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

246 posts Interpretability (ML & AI) Instrumental Convergence Corrigibility Myopia Deconfusion Self Fulfilling/Refuting Prophecies Orthogonality Thesis AI Success Models Treacherous Turn Gradient Hacking Mild Optimization Quantilization

112 posts Iterated Amplification Debate (AI safety technique) Factored Cognition Humans Consulting HCH Experiments Ought AI-assisted Alignment Memory and Mnemonics Air Conditioning Verification

230 A Mechanistic Interpretability Analysis of Grokking

Neel Nanda

4mo

39

174 Self-fulfilling correlations

PhilGoetz

12y

50

172 The Plan - 2022 Update

johnswentworth

19d

33

171 Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Ben Pace

3y

60

158 MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models"

Rob Bensinger

1y

13

149 A transparency and interpretability tech tree

evhub

6mo

10

138 Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner

1y

16

134 Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res

1y

35

132 Sorting Pebbles Into Correct Heaps

Eliezer Yudkowsky

14y

109

130 Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth

6mo

52

127 Let's See You Write That Corrigibility Tag

Eliezer Yudkowsky

6mo

67

125 Goal retention discussion with Eliezer

MaxTegmark

8y

26

113 A broad basin of attraction around human values?

Wei_Dai

8mo

16

112 Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout

3y

38

132 Debate update: Obfuscated arguments problem

Beth Barnes

1y

21

118 Godzilla Strategies

johnswentworth

6mo

65

117 My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

Chi Nguyen

2y

21

116 Supervise Process, not Outcomes

stuhlmueller

8mo

8

111 Paul's research agenda FAQ

zhukeepa

4y

73

107 Solving Math Problems by Relay

bgold

2y

26

106 Preregistration: Air Conditioner Test

johnswentworth

8mo

64

102 Imitative Generalisation (AKA 'Learning the Prior')

Beth Barnes

1y

14

91 Writeup: Progress on AI Safety via Debate

Beth Barnes

2y

18

85 Model splintering: moving from one imperfect model to another

Stuart_Armstrong

2y

10

84 Rant on Problem Factorization for Alignment

johnswentworth

4mo

48

80 Air Conditioner Test Results & Discussion

johnswentworth

6mo

38

79 Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie

3mo

4

77 Why I'm excited about Debate

Richard_Ngo

1y

12