Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
734 posts
AI
Eliciting Latent Knowledge (ELK)
Infra-Bayesianism
Counterfactuals
Logic & Mathematics
Interviews
Audio
AXRP
Redwood Research
Transcripts
Formal Proof
Domain Theory
74 posts
Embedded Agency
Reinforcement Learning
Subagents
Reward Functions
EfficientZero
Robust Agents
Wireheading
AI Capabilities
Spurious Counterfactuals
Category Theory
Memetics
Tradeoffs
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
251
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
12
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
53
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
182
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
29
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
129
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
76
Finding gliders in the game of life
paulfchristiano
19d
7
23
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
7d
3
503
(My understanding of) What Everyone in Technical Alignment is Doing and Why
Thomas Larsen
3mo
83
7
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
104
Will we run out of ML data? Evidence from projecting dataset size trends
Pablo Villalobos
1mo
12
271
Reward is not the optimization target
TurnTrout
4mo
97
43
A Short Dialogue on the Meaning of Reward Functions
Leon Lang
1mo
0
89
Scaling Laws for Reward Model Overoptimization
leogao
2mo
11
237
Humans are very reliable agents
alyssavance
6mo
35
107
Evaluations project @ ARC is hiring a researcher and a webdev/engineer
Beth Barnes
3mo
7
334
EfficientZero: How It Works
1a3orn
1y
42
61
Towards deconfusing wireheading and reward maximization
leogao
3mo
7
42
Four usages of "loss" in AI
TurnTrout
2mo
18
72
Seriously, what goes wrong with "reward the agent when it makes you smile"?
TurnTrout
4mo
41
120
We have achieved Noob Gains in AI
phdead
7mo
21
32
It matters when the first sharp left turn happens
Adam Jermyn
2mo
9
124
EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
gwern
1y
52