Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
1014 posts
AI
AI Timelines
Value Learning
AI Takeoff
Embedded Agency
Community
Eliciting Latent Knowledge (ELK)
Reinforcement Learning
Infra-Bayesianism
Counterfactuals
Logic & Mathematics
Interviews
111 posts
Iterated Amplification
Game Theory
Factored Cognition
Humans Consulting HCH
Research Agendas
Ought
Debate (AI safety technique)
Risks of Astronomical Suffering (S-risks)
Center on Long-Term Risk (CLR)
Mechanism Design
Fairness
Group Rationality
13
Note on algorithms with multiple trained components
Steven Byrnes
7h
1
45
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
30
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
35
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
213
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
73
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
99
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
23
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
55
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
136
Using GPT-Eliezer against ChatGPT Jailbreaking
Stuart_Armstrong
14d
77
17
Looking for an alignment tutor
JanBrauner
3d
2
106
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
LawrenceC
17d
9
106
Finding gliders in the game of life
paulfchristiano
19d
7
37
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.
Charlie Steiner
7d
3
54
«Boundaries», Part 3b: Alignment problems in terms of boundaries
Andrew_Critch
6d
2
35
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
47
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.
Charlie Steiner
8d
14
52
Notes on OpenAI’s alignment plan
Alex Flint
12d
5
231
On how various plans miss the hard bits of the alignment challenge
So8res
5mo
81
155
Some conceptual alignment research projects
Richard_Ngo
3mo
14
204
Unifying Bargaining Notions (1/2)
Diffractor
4mo
38
98
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Diffractor
2mo
11
16
Theories of impact for Science of Deep Learning
Marius Hobbhahn
19d
0
155
«Boundaries», Part 1: a key missing concept from utility theory
Andrew_Critch
4mo
26
120
Unifying Bargaining Notions (2/2)
Diffractor
4mo
11
85
Rant on Problem Factorization for Alignment
johnswentworth
4mo
48
122
Supervise Process, not Outcomes
stuhlmueller
8mo
8
31
A Library and Tutorial for Factored Cognition with Language Models
stuhlmueller
2mo
0