Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
2595 posts
AI
AI Timelines
AI Takeoff
Interpretability (ML & AI)
Careers
Instrumental Convergence
Iterated Amplification
Corrigibility
Audio
Debate (AI safety technique)
Infra-Bayesianism
DeepMind
488 posts
GPT
Conjecture (org)
Art
Music
Machine Learning (ML)
Bounties & Prizes (active)
OpenAI
QURI
Language Models
Project Announcement
DALL-E
Meta-Humor
7
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic
Akash
2h
0
29
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.
Charlie Steiner
19h
0
33
The "Minimal Latents" Approach to Natural Abstractions
johnswentworth
22h
6
40
Towards Hodge-podge Alignment
Cleo Nardo
1d
20
10
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
199
AI alignment is distinct from its near-term applications
paulfchristiano
7d
5
106
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
108
The next decades might be wild
Marius Hobbhahn
5d
21
70
Can we efficiently explain model behaviors?
paulfchristiano
4d
0
15
Solution to The Alignment Problem
Algon
1d
0
95
Trying to disambiguate different questions about whether RLHF is “good”
Buck
6d
39
22
Event [Berkeley]: Alignment Collaborator Speed-Meeting
AlexMennen
1d
2
54
High-level hopes for AI alignment
HoldenKarnofsky
5d
3
39
Proper scoring rules don’t guarantee predicting fixed points
Johannes_Treutlein
4d
2
27
Discovering Language Model Behaviors with Model-Written Evaluations
evhub
4h
3
43
Next Level Seinfeld
Zvi
1d
6
70
Bad at Arithmetic, Promising at Math
cohenmacaulay
2d
17
26
Take 11: "Aligning language models" should be weirder.
Charlie Steiner
2d
0
11
Does ChatGPT’s performance warrant working on a tutor for children? [It’s time to take it to the lab.]
Bill Benzon
1d
2
59
[Interim research report] Taking features out of superposition with sparse autoencoders
Lee Sharkey
7d
10
160
Jailbreaking ChatGPT on Release Day
Zvi
18d
74
55
A brainteaser for language models
Adam Scherlis
8d
3
38
Discovering Latent Knowledge in Language Models Without Supervision
Xodarap
6d
1
57
Reframing inner alignment
davidad
9d
13
98
Did ChatGPT just gaslight me?
ThomasW
19d
45
132
Conjecture: a retrospective after 8 months of work
Connor Leahy
27d
9
3
Will research in AI risk jinx it? Consequences of training AI on AI risk arguments
Yann Dubois
1d
6
103
What I Learned Running Refine
adamShimi
26d
5