Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
101 posts
Reinforcement Learning
AI Capabilities
Inverse Reinforcement Learning
Wireheading
Definitions
Reward Functions
Stag Hunt
Road To AI Safety Excellence
Goals
Prompt Engineering
EfficientZero
PaLM
63 posts
Value Learning
The Pointers Problem
252
Reward is not the optimization target
TurnTrout
4mo
97
10
Note on algorithms with multiple trained components
Steven Byrnes
6h
1
74
Will we run out of ML data? Evidence from projecting dataset size trends
Pablo Villalobos
1mo
12
81
When AI solves a game, focus on the game's mechanics, not its theme.
Cleo Nardo
27d
7
76
Seriously, what goes wrong with "reward the agent when it makes you smile"?
TurnTrout
4mo
41
21
generalized wireheading
carado
1mo
7
23
What's the Most Impressive Thing That GPT-4 Could Plausibly Do?
bayesed
3mo
24
8
AGIs may value intrinsic rewards more than extrinsic ones
catubc
1mo
6
25
Is CIRL a promising agenda?
Chris_Leong
6mo
12
34
Remaking EfficientZero (as best I can)
Hoagy
5mo
9
-1
Reward IS the Optimization Target
Carn
2mo
3
11
How might we make better use of AI capabilities research for alignment purposes?
ghostwheel
3mo
4
8
Reinforcement Learner Wireheading
Nate Showell
5mo
2
276
Is AI Progress Impossible To Predict?
alyssavance
7mo
38
23
Latent Variables and Model Mis-Specification
jsteinhardt
4y
7
15
Stable Pointers to Value: An Agent Embedded in Its Own Utility Function
abramdemski
5y
9
104
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
johnswentworth
2y
43
38
AI Alignment Problem: “Human Values” don’t Actually Exist
avturchin
3y
29
50
The easy goal inference problem is still hard
paulfchristiano
4y
19
56
Humans can be assigned any values whatsoever…
Stuart_Armstrong
4y
26
37
Since figuring out human values is hard, what about, say, monkey values?
shminux
2y
13
10
AIs should learn human preferences, not biases
Stuart_Armstrong
8mo
1
34
Human-AI Interaction
Rohin Shah
3y
10
19
An Open Philanthropy grant proposal: Causal representation learning of human preferences
PabloAMC
11mo
6
49
What is ambitious value learning?
Rohin Shah
4y
28
13
Can few-shot learning teach AI right from wrong?
Charlie Steiner
4y
3
17
Morally underdefined situations can be deadly
Stuart_Armstrong
1y
8
25
Learning human preferences: black-box, white-box, and structured white-box access
Stuart_Armstrong
2y
9