Go Back
Choose this branch
Choose this branch
meritocratic
regular
democratic
hot
top
alive
103 posts
Interpretability (ML & AI)
Machine Learning (ML)
DeepMind
Truth, Semantics, & Meaning
AI Success Models
OpenAI
Lottery Ticket Hypothesis
Anthropic
Conservatism (AI)
Honesty
Principal-Agent Problems
Map and Territory
50 posts
GPT
Bounties & Prizes (active)
AI-assisted Alignment
Moore's Law
Compute
Nanotechnology
List of Links
AI Safety Public Materials
Computer Science
Tripwire
Quantum Mechanics
13
An Open Agency Architecture for Safe Transformative AI
davidad
11h
11
123
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
Collin
5d
18
265
A challenge for AGI organizations, and a challenge for readers
Rob Bensinger
19d
30
211
The Plan - 2022 Update
johnswentworth
19d
33
26
Paper: Transformers learn in-context by gradient descent
LawrenceC
4d
11
47
Reframing inner alignment
davidad
9d
13
99
Re-Examining LayerNorm
Eric Winsor
19d
8
22
Extracting and Evaluating Causal Direction in LLMs' Activations
Fabien Roger
6d
2
31
[ASoT] Natural abstractions and AlphaZero
Ulisse Mini
10d
1
57
Multi-Component Learning and S-Curves
Adam Jermyn
20d
24
364
DeepMind alignment team opinions on AGI ruin arguments
Vika
4mo
34
20
My thoughts on OpenAI's Alignment plan
Donald Hobson
10d
0
72
Engineering Monosemanticity in Toy Models
Adam Jermyn
1mo
6
338
A Mechanistic Interpretability Analysis of Grokking
Neel Nanda
4mo
39
59
Predicting GPU performance
Marius Hobbhahn
6d
24
93
[Link] Why I’m optimistic about OpenAI’s alignment approach
janleike
15d
13
26
An exploration of GPT-2's embedding weights
Adam Scherlis
7d
2
60
By Default, GPTs Think In Plain Sight
Fabien Roger
1mo
16
31
[ASoT] Finetuning, RL, and GPT's world prior
Jozdien
18d
8
7
Alignment with argument-networks and assessment-predictions
Tor Økland Barstad
7d
3
16
Research request (alignment strategy): Deep dive on "making AI solve alignment for us"
JanBrauner
19d
3
13
[LINK] - ChatGPT discussion
JanBrauner
19d
7
92
Beliefs and Disagreements about Automating Alignment Research
Ian McKenzie
3mo
4
223
New Scaling Laws for Large Language Models
1a3orn
8mo
21
36
Prizes for ML Safety Benchmark Ideas
joshc
1mo
3
151
Godzilla Strategies
johnswentworth
6mo
65
68
$20K In Bounties for AI Safety Public Materials
Dan H
4mo
7
72
NeurIPS ML Safety Workshop 2022
Dan H
4mo
2