Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
620 posts
AI
Eliciting Latent Knowledge (ELK)
AI Robustness
Truthful AI
Autonomy and Choice
Intelligence Explosion
Social Media
Transcripts
114 posts
Infra-Bayesianism
Counterfactuals
Interviews
Audio
Logic & Mathematics
AXRP
Redwood Research
Domain Theory
Counterfactual Mugging
Newcomb's Problem
Formal Proof
Functional Decision Theory
25
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
136
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
82
A shot at the diamond-alignment problem
TurnTrout
2mo
53
113
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
213
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
39
In defense of probably wrong mechanistic models
evhub
14d
10
64
Verification Is Not Easier Than Generation In General
johnswentworth
14d
23
106
Finding gliders in the game of life
paulfchristiano
19d
7
32
Concept extrapolation for hypothesis generation
Stuart_Armstrong
8d
2
65
Update to Mysteries of mode collapse: text-davinci-002 not RLHF
janus
1mo
8
85
How could we know that an AGI system will have good consequences?
So8res
1mo
24
106
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
98
Infra-Bayesian physicalism: a formal theory of naturalized induction
Vanessa Kosoy
1y
20
118
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
maxnadeau
1mo
14
18
Latent Adversarial Training
Adam Jermyn
5mo
9
134
Takeaways from our robust injury classifier project [Redwood Research]
dmz
3mo
9
12
Counterfactuals are Confusing because of an Ontological Shift
Chris_Leong
4mo
35
44
Probability as Minimal Map
johnswentworth
3y
10
73
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
KevinRoWang
1mo
5
39
Hessian and Basin volume
Vivek Hebbar
5mo
9
98
High-stakes alignment via adversarial training [Redwood Research report]
dmz
7mo
29
121
Introduction To The Infra-Bayesianism Sequence
Diffractor
2y
64
41
Infra-Exercises, Part 1
Diffractor
3mo
9
54
Redwood's Technique-Focused Epistemic Strategy
adamShimi
1y
1
18
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
4mo
0