Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
103 posts
Interpretability (ML & AI)
Machine Learning (ML)
DeepMind
Truth, Semantics, & Meaning
AI Success Models
OpenAI
Lottery Ticket Hypothesis
Anthropic
Conservatism (AI)
Honesty
Principal-Agent Problems
Map and Territory
50 posts
GPT
Bounties & Prizes (active)
AI-assisted Alignment
Moore's Law
Compute
Nanotechnology
List of Links
AI Safety Public Materials
Computer Science
Tripwire
Quantum Mechanics
15
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
132
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
307
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
239
The Plan - 2022 Update
johnswentworth
19d
33
142
Re-Examining LayerNorm
Eric Winsor
19d
8
33
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
20
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
35
Reframing inner alignment
davidad
9d
13
25
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
53
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
422
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
140
Clarifying AI X-risk
zac_kenton
1mo
23
410
DeepMind alignment team opinions on AGI ruin arguments
Vika
4mo
34
21
My thoughts on OpenAI's Alignment plan
Donald Hobson
10d
0
70
Predicting GPU performance
Marius Hobbhahn
6d
24
96
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
34
An exploration of GPT-2's embedding weights
Adam Scherlis
7d
2
62
[ASoT] Finetuning, RL, and GPT's world prior
Jozdien
18d
8
75
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
10
Alignment with argument-networks and assessment-predictions
Tor Økland Barstad
7d
3
20
Research request (alignment strategy): Deep dive on "making AI solve alignment for us"
JanBrauner
19d
3
14
[LINK] - ChatGPT discussion
JanBrauner
19d
7
255
New Scaling Laws for Large Language Models
1a3orn
8mo
21
175
Godzilla Strategies
johnswentworth
6mo
65
99
Beliefs and Disagreements about Automating Alignment Research
Ian McKenzie
3mo
4
36
Prizes for ML Safety Benchmark Ideas
joshc
1mo
3
84
$20K In Bounties for AI Safety Public Materials
Dan H
4mo
7
43
Recall and Regurgitation in GPT2
Megan Kinniment
2mo
1