Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
808 posts
AI
Embedded Agency
Eliciting Latent Knowledge (ELK)
Reinforcement Learning
Infra-Bayesianism
Counterfactuals
Logic & Mathematics
AI Capabilities
Interviews
Audio
Subagents
Wireheading
126 posts
Value Learning
Inverse Reinforcement Learning
Machine Intelligence Research Institute (MIRI)
Agent Foundations
Meta-Philosophy
Metaethics
Community
Philosophy
The Pointers Problem
Moral Uncertainty
Cognitive Reduction
Center for Human-Compatible AI (CHAI)
13
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
30
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
213
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
73
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
55
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
136
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
106
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
106
Finding gliders in the game of life
paulfchristiano
19d
7
37
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
7d
3
64
Verification Is Not Easier Than Generation In General
johnswentworth
14d
23
32
Concept extrapolation for hypothesis generation
Stuart_Armstrong
8d
2
23
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
17
Looking for an alignment tutor
JanBrauner
3d
2
51
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
Akash
28d
20
53
Don't design agents which exploit adversarial inputs
TurnTrout
1mo
61
39
People care about each other even though they have imperfect motivational pointers?
TurnTrout
1mo
25
40
Beyond Kolmogorov and Shannon
Alexander Gietelink Oldenziel
1mo
14
197
Why Agent Foundations? An Overly Abstract Explanation
johnswentworth
9mo
54
25
A newcomer’s guide to the technical AI safety field
zeshen
1mo
1
45
Clarifying the Agent-Like Structure Problem
johnswentworth
2mo
14
57
AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022
Sam Bowman
3mo
2
70
Encultured AI Pre-planning, Part 1: Enabling New Benchmarks
Andrew_Critch
4mo
2
45
Prize and fast track to alignment research at ALTER
Vanessa Kosoy
3mo
4
59
Seeking Interns/RAs for Mechanistic Interpretability Projects
Neel Nanda
4mo
0
12
The Slippery Slope from DALLE-2 to Deepfake Anarchy
scasper
1mo
9