Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
808 posts
AI
Embedded Agency
Eliciting Latent Knowledge (ELK)
Reinforcement Learning
Infra-Bayesianism
Counterfactuals
Logic & Mathematics
AI Capabilities
Interviews
Audio
Subagents
Wireheading
126 posts
Value Learning
Inverse Reinforcement Learning
Machine Intelligence Research Institute (MIRI)
Agent Foundations
Meta-Philosophy
Metaethics
Community
Philosophy
The Pointers Problem
Moral Uncertainty
Cognitive Reduction
Center for Human-Compatible AI (CHAI)
79
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
39
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
251
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
7
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
12
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
53
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
85
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
182
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
154
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
49
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
29
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
129
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
76
Finding gliders in the game of life
paulfchristiano
19d
7
23
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
7d
3
13
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
13
Looking for an alignment tutor
JanBrauner
3d
2
87
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
Akash
28d
20
67
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
80
Beyond Kolmogorov and Shannon
Alexander Gietelink Oldenziel
1mo
14
297
Why Agent Foundations? An Overly Abstract Explanation
johnswentworth
9mo
54
85
Prize and fast track to alignment research at ALTER
Vanessa Kosoy
3mo
4
35
A newcomer’s guide to the technical AI safety field
zeshen
1mo
1
91
AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022
Sam Bowman
3mo
2
61
Clarifying the Agent-Like Structure Problem
johnswentworth
2mo
14
99
Announcing the Introduction to ML Safety course
Dan H
4mo
6
25
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
20
The Slippery Slope from DALLE-2 to Deepfake Anarchy
scasper
1mo
9
63
Seeking Interns/RAs for Mechanistic Interpretability Projects
Neel Nanda
4mo
0