Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

71 posts Outer Alignment Optimization Mesa-Optimization Neuroscience Neuromorphic AI General Intelligence Predictive Processing AI Services (CAIS) Selection vs Control Neocortex Distinctions Computing Overhang

47 posts Inner Alignment Solomonoff Induction Priors Occam's Razor

42 Don't align agents to evaluations of plans

TurnTrout

24d

46

64 Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC

4d

10

75 My take on Jacob Cannell’s take on AGI safety

Steven Byrnes

22d

13

20 Take 6: CAIS is actually Orwellian.

Charlie Steiner

13d

5

95 What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?

johnswentworth

4mo

15

59 Humans aren't fitness maximizers

So8res

2mo

45

71 Human Mimicry Mainly Works When We’re Already Close

johnswentworth

4mo

16

6 Inner alignment: what are we pointing at?

lcmgcd

3mo

2

41 Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes

2y

7

89 Bottle Caps Aren't Optimisers

DanielFilan

4y

21

32 Outer alignment and imitative amplification

evhub

2y

11

61 Multi-agent predictive minds and AI alignment

Jan_Kulveit

4y

18

85 Risks from Learned Optimization: Conclusion and Related Work

evhub

3y

4

46 The Steering Problem

paulfchristiano

4y

12

90 Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout

18d

18

20 Value Formation: An Overarching Model

Thane Ruthenis

1mo

6

46 Applications for Deconfusing Goal-Directedness

adamShimi

1y

3

42 Mesa-Optimizers via Grokking

orthonormal

14d

4

45 Threat Model Literature Review

zac_kenton

1mo

4

79 Externalized reasoning oversight: a research direction for language model alignment

tamera

4mo

22

33 Framing AI Childhoods

David Udell

3mo

8

44 Outer vs inner misalignment: three framings

Richard_Ngo

5mo

4

45 Towards an empirical investigation of inner alignment

evhub

3y

9

77 2-D Robustness

vlad_m

3y

8

1 Simplicity priors with reflective oracles

Benya_Fallenstein

8y

0

16 The universal prior is malign

paulfchristiano

6y

0

32 Inner alignment requires making assumptions about human values

Matthew Barnett

2y

9

38 Re-Define Intent Alignment?

abramdemski

1y

33