Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
180 posts
Research Agendas
Embedded Agency
Suffering
Agency
Animal Welfare
Risks of Astronomical Suffering (S-risks)
Robust Agents
Cause Prioritization
Center on Long-Term Risk (CLR)
80,000 Hours
Crucial Considerations
Veg*nism
164 posts
Value Learning
Reinforcement Learning
AI Capabilities
Inverse Reinforcement Learning
Wireheading
Definitions
Reward Functions
The Pointers Problem
Stag Hunt
Road To AI Safety Excellence
Goals
EfficientZero
34
My AGI safety research—2022 review, ’23 plans
Steven Byrnes
6d
6
16
Riffing on the agent type
Quinn
12d
0
258
On how various plans miss the hard bits of the alignment challenge
So8res
5mo
81
168
Some conceptual alignment research projects
Richard_Ngo
3mo
14
60
New book on s-risks
Tobias_Baumann
1mo
1
248
Humans are very reliable agents
alyssavance
6mo
35
21
LLMs may capture key components of human agency
catubc
1mo
0
11
Sets of objectives for a multi-objective RL agent to optimize
Ben Smith
27d
0
12
Interpreting systems as solving POMDPs: a step towards a formal understanding of agency [paper link]
the gears to ascenscion
1mo
2
15
Distilled Representations Research Agenda
Hoagy
2mo
2
14
Cooperators are more powerful than agents
Ivan Vendrov
2mo
7
7
The two conceptions of Active Inference: an intelligence architecture and a theory of agency
Roman Leventov
1mo
0
49
Eliciting Latent Knowledge (ELK) - Distillation/Summary
Marius Hobbhahn
6mo
2
40
Gradations of Agency
Daniel Kokotajlo
7mo
6
10
Note on algorithms with multiple trained components
Steven Byrnes
6h
1
81
When AI solves a game, focus on the game's mechanics, not its theme.
Cleo Nardo
27d
7
74
Will we run out of ML data? Evidence from projecting dataset size trends
Pablo Villalobos
1mo
12
252
Reward is not the optimization target
TurnTrout
4mo
97
40
A Short Dialogue on the Meaning of Reward Functions
Leon Lang
1mo
0
276
Is AI Progress Impossible To Predict?
alyssavance
7mo
38
21
generalized wireheading
carado
1mo
7
273
EfficientZero: How It Works
1a3orn
1y
42
76
Seriously, what goes wrong with "reward the agent when it makes you smile"?
TurnTrout
4mo
41
6
Can GPT-3 Write Contra Dances?
jefftk
16d
0
6
Mastering Stratego (Deepmind)
svemirski
18d
0
8
AGIs may value intrinsic rewards more than extrinsic ones
catubc
1mo
6
134
EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
gwern
1y
52
22
Character alignment
p.b.
3mo
0