Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
620 posts
AI
Eliciting Latent Knowledge (ELK)
AI Robustness
Truthful AI
Autonomy and Choice
Intelligence Explosion
Social Media
Transcripts
114 posts
Infra-Bayesianism
Counterfactuals
Interviews
Audio
Logic & Mathematics
AXRP
Redwood Research
Domain Theory
Counterfactual Mugging
Newcomb's Problem
Formal Proof
Functional Decision Theory
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
182
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
72
A shot at the diamond-alignment problem
TurnTrout
2mo
53
129
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
251
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
43
In defense of probably wrong mechanistic models
evhub
14d
10
48
Verification Is Not Easier Than Generation In General
johnswentworth
14d
23
76
Finding gliders in the game of life
paulfchristiano
19d
7
8
Concept extrapolation for hypothesis generation
Stuart_Armstrong
8d
2
73
Update to Mysteries of mode collapse: text-davinci-002 not RLHF
janus
1mo
8
87
How could we know that an AGI system will have good consequences?
So8res
1mo
24
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
98
Infra-Bayesian physicalism: a formal theory of naturalized induction
Vanessa Kosoy
1y
20
150
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
maxnadeau
1mo
14
30
Latent Adversarial Training
Adam Jermyn
5mo
9
136
Takeaways from our robust injury classifier project [Redwood Research]
dmz
3mo
9
22
Counterfactuals are Confusing because of an Ontological Shift
Chris_Leong
4mo
35
54
Probability as Minimal Map
johnswentworth
3y
10
99
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
KevinRoWang
1mo
5
27
Hessian and Basin volume
Vivek Hebbar
5mo
9
174
High-stakes alignment via adversarial training [Redwood Research report]
dmz
7mo
29
87
Introduction To The Infra-Bayesianism Sequence
Diffractor
2y
64
57
Infra-Exercises, Part 1
Diffractor
3mo
9
42
Redwood's Technique-Focused Epistemic Strategy
adamShimi
1y
1
14
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
4mo
0