Tree of Tags

Go Back

Choose this branch

Choose this branch

meritocratic regular democratic

hot top alive

620 posts AI Eliciting Latent Knowledge (ELK) AI Robustness Truthful AI Autonomy and Choice Intelligence Explosion Social Media Transcripts

114 posts Infra-Bayesianism Counterfactuals Interviews Audio Logic & Mathematics AXRP Redwood Research Domain Theory Counterfactual Mugging Newcomb's Problem Formal Proof Functional Decision Theory

79 Towards Hodge-podge Alignment

Cleo Nardo

1d

20

39 The "Minimal Latents" Approach to Natural Abstractions

johnswentworth

22h

6

251 AI alignment is distinct from its near-term applications

paulfchristiano

7d

5

12 Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems.

Charlie Steiner

19h

0

53 Can we efficiently explain model behaviors?

paulfchristiano

4d

0

85 Trying to disambiguate different questions about whether RLHF is “good”

Buck

6d

39

182 Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong

14d

77

49 Existential AI Safety is NOT separate from near-term applications

scasper

7d

15

29 High-level hopes for AI alignment

HoldenKarnofsky

5d

3

129 Mechanistic anomaly detection and ELK

paulfchristiano

25d

17

76 Finding gliders in the game of life

paulfchristiano

19d

7

23 Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner

7d

3

503 (My understanding of) What Everyone in Technical Alignment is Doing and Why

Thomas Larsen

3mo

83

48 Verification Is Not Easier Than Generation In General

johnswentworth

14d

23

154 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC

17d

9

150 Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau

1mo

14

99 Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

KevinRoWang

1mo

5

25 Causal scrubbing: results on a paren balance checker

LawrenceC

17d

0

136 Takeaways from our robust injury classifier project [Redwood Research]

dmz

3mo

9

17 Causal scrubbing: Appendix

LawrenceC

17d

0

29 Vanessa Kosoy's PreDCA, distilled

Martín Soto

1mo

17

174 High-stakes alignment via adversarial training [Redwood Research report]

dmz

7mo

29

35 A conversation about Katja's counterarguments to AI risk

Matthew Barnett

2mo

9

57 Infra-Exercises, Part 1

Diffractor

3mo

9

30 Ethan Perez on the Inverse Scaling Prize, Language Feedback and Red Teaming

Michaël Trazzi

3mo

0

98 Infra-Bayesian physicalism: a formal theory of naturalized induction

Vanessa Kosoy

1y

20

116 Redwood Research’s current project

Buck

1y

29

95 Why I'm excited about Redwood Research's current project

paulfchristiano

1y

6