Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
808 posts
AI
Embedded Agency
Eliciting Latent Knowledge (ELK)
Reinforcement Learning
Infra-Bayesianism
Counterfactuals
Logic & Mathematics
AI Capabilities
Interviews
Audio
Subagents
Wireheading
126 posts
Value Learning
Inverse Reinforcement Learning
Machine Intelligence Research Institute (MIRI)
Agent Foundations
Meta-Philosophy
Metaethics
Community
Philosophy
The Pointers Problem
Moral Uncertainty
Cognitive Reduction
Center for Human-Compatible AI (CHAI)
62
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
37
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
10
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
21
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
232
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
63
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
92
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
159
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
42
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
130
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
37
Existential AI Safety is NOT separate from near-term applications
scasper
7d
15
91
Finding gliders in the game of life
paulfchristiano
19d
7
121
Mechanistic anomaly detection and ELK
paulfchristiano
25d
17
30
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
7d
3
18
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
15
Looking for an alignment tutor
JanBrauner
3d
2
69
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
Akash
28d
20
60
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
60
Beyond Kolmogorov and Shannon
Alexander Gietelink Oldenziel
1mo
14
32
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
247
Why Agent Foundations? An Overly Abstract Explanation
johnswentworth
9mo
54
30
A newcomer’s guide to the technical AI safety field
zeshen
1mo
1
65
Prize and fast track to alignment research at ALTER
Vanessa Kosoy
3mo
4
53
Clarifying the Agent-Like Structure Problem
johnswentworth
2mo
14
74
AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022
Sam Bowman
3mo
2
69
Announcing the Introduction to ML Safety course
Dan H
4mo
6
61
Seeking Interns/RAs for Mechanistic Interpretability Projects
Neel Nanda
4mo
0
62
Encultured AI Pre-planning, Part 1: Enabling New Benchmarks
Andrew_Critch
4mo
2