Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
734 posts
AI
Eliciting Latent Knowledge (ELK)
Infra-Bayesianism
Counterfactuals
Logic & Mathematics
Interviews
Audio
AXRP
Redwood Research
Transcripts
Formal Proof
Domain Theory
74 posts
Embedded Agency
Reinforcement Learning
Subagents
Reward Functions
EfficientZero
Robust Agents
Wireheading
AI Capabilities
Spurious Counterfactuals
Category Theory
Memetics
Tradeoffs
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
182
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
72
A shot at the diamond-alignment problem
TurnTrout
2mo
53
129
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
251
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
43
In defense of probably wrong mechanistic models
evhub
14d
10
48
Verification Is Not Easier Than Generation In General
johnswentworth
14d
23
76
Finding gliders in the game of life
paulfchristiano
19d
7
8
Concept extrapolation for hypothesis generation
Stuart_Armstrong
8d
2
73
Update to Mysteries of mode collapse: text-davinci-002 not RLHF
janus
1mo
8
271
Reward is not the optimization target
TurnTrout
4mo
97
42
Four usages of "loss" in AI
TurnTrout
2mo
18
107
Evaluations project @ ARC is hiring a researcher and a webdev/engineer
Beth Barnes
3mo
7
72
Seriously, what goes wrong with "reward the agent when it makes you smile"?
TurnTrout
4mo
41
89
Scaling Laws for Reward Model Overoptimization
leogao
2mo
11
43
Gradations of Agency
Daniel Kokotajlo
7mo
6
32
It matters when the first sharp left turn happens
Adam Jermyn
2mo
9
61
Towards deconfusing wireheading and reward maximization
leogao
3mo
7
26
Conditioning, Prompts, and Fine-Tuning
Adam Jermyn
4mo
9
188
Why Subagents?
johnswentworth
3y
42
237
Humans are very reliable agents
alyssavance
6mo
35
23
The reward engineering problem
paulfchristiano
3y
3
25
Committing, Assuming, Externalizing, and Internalizing
Scott Garrabrant
2y
25
29
Eight Definitions of Observability
Scott Garrabrant
2y
26