C4 — Quality trend across cohorts

Context

MATS has changed its selection process across cohorts: 7.0 and 8.0 were fully decentralized (each stream reviewed its own applicants), 9.0 added partial centralized review on top, and 10.0 went fully centralized. Did selection quality improve as a result?

This is a hard question because we can't observe quality directly — we observe proxies (mentor evals, SRP/FRP scores, post-program publications), each measured with different instruments across cohorts. We compare relative shape rather than raw levels, and flag caveats throughout.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Why this comparison is hard

Different mentor-eval instruments. 6.0 is end-of-program; 7.0–9.0 are mid-program. The 9.0 instrument is richer (added reference-class rank, hire/pub likelihood). We use the four-dimension composite that's common across cohorts.
SRP/FRP rubric changes. 7.0: ToC + Activities. 8.0: ToC + Progress + Activities. 9.0/FRP: research success, AI risk reduction, researcher ability, symposium. Raw scores are NOT comparable; we report within-cohort percentiles where needed.
Cohort size scales. 7.0 had ~75 fellows; 10.0 has ~120. More fellows could dilute average quality (regression to the mean) — or not, if applicant pool grew proportionally.
Publication recency. 6.0 alumni have had ~2 years to publish; 9.0 alumni have had ~6 months. Cross-cohort publication-rate comparisons partly reflect time-to-publish, not just program quality.
10.0 isn't here yet. The program just started; we have no in-program data on 10.0 fellows.

Mentor-eval composite distributions

Publication rates by latest cohort

Cohort	n	Mean	Median	P25	P75	Frac ≥8/10
6.0	95	7.20	7.25	6.38	8.12	22%
7.0	76	7.34	7.38	6.50	8.00	33%
8.0	106	7.16	7.25	6.56	7.75	25%
9.0	93	7.52	7.50	6.75	8.25	40%

Latest cohort	n	P(has ≥1 pub)	Median n_pubs
5.0	26	54%	1
5.1	25	76%	1
6.0	42	62%	1
6.1	33	94%	3
7.0	27	70%	2
7.1	46	80%	2
8.0	21	48%	0
8.1	75	63%	1
9.0	87	14%	0

Heavy recency confound here — 6.0 alumni have had ~2 years to publish, 9.0 only ~6 months. The apparent decline in publication rate is largely time-driven, not quality-driven.

SRP/FRP

Raw cohort means: 7.0=78.3, 8.0=79.8, 9.0=2.8. Cross-cohort comparison of raw scores is not meaningful — different rubrics. Within-cohort percentile is what we use for cross-cohort analyses elsewhere (e.g., C2, C3 implicitly).

C4 — Has selection quality improved across cohorts?

Context

Mentor-eval composite distributions

Publication rates by latest cohort

SRP/FRP

Takeaways

Latest cohort	n	P(has ≥1 pub)	Median n_pubs
5.0	26	54%	1
5.1	25	76%	1
6.0	42	62%	1
6.1	33	94%	3
7.0	27	70%	2
7.1	46	80%	2
8.0	21	48%	0
8.1	75	63%	1
9.0	87	14%	0

Latest cohort	n	P(has ≥1 pub)	Median n_pubs
5.0	26	54%	1
5.1	25	76%	1
6.0	42	62%	1
6.1	33	94%	3
7.0	27	70%	2
7.1	46	80%	2
8.0	21	48%	0
8.1	75	63%	1
9.0	87	14%	0

Latest cohort	n	P(has ≥1 pub)	Median n_pubs
5.0	26	54%	1
5.1	25	76%	1
6.0	42	62%	1
6.1	33	94%	3
7.0	27	70%	2
7.1	46	80%	2
8.0	21	48%	0
8.1	75	63%	1
9.0	87	14%	0