C3 — Predicting publications

Context

The most downstream outcome we can measure for fellows is post-program publications. The alumni-publication tracker covers cohorts 5.0–9.0 (584 entries) with paper counts and citation counts per fellow. We join to application data and ask: which application features predict whether a fellow goes on to publish?

Note: this is far downstream of selection. Many factors between getting into MATS and publishing — mentor fit, project, post-MATS opportunities — affect the outcome. We expect low R² / modest AUC, and 'publication' is a narrow outcome that doesn't capture all kinds of impact.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Caveats

Multi-cohort fellows attributed to latest cohort per project memory — avoids double-counting.
Recency bias: 9.0 alumni have had less time to publish than 6.0 alumni. Raw publication-rate comparisons across cohorts conflate program-effect with time-since.
Email joins are messy per CLAUDE.md caveat #9. Anonymization uses person_id; some unmatched records expected.
Cohorts 5.0/5.1/6.1/7.1/8.1 have alumni-pub data but no apps data — they're excluded from the predictive model (visible in the per-cohort summary table).

Latest cohort	n alumni-pub rows	with ≥1 pub	median n_pubs	mean citations
5.0	26	14	1	41.7
5.1	25	19	1	45.9
6.0	42	26	1	33.0
6.1	33	31	3	37.8
7.0	27	19	2	10.3
7.1	46	37	2	18.9
8.0	21	10	0	0.9
8.1	75	47	1	8.3
9.0	87	12	0	2.2

Cohort	n	with pub	AUC	95% CI
6.0	42	26	0.724	[0.544, 0.883]
7.0	26	18	0.757	[0.542, 0.944]
8.0	21	10	0.777	[0.564, 0.969]
9.0	87	13	0.753	[0.593, 0.882]

C3 — Do application features predict post-program publications?

Context

Per-cohort publication rates

Logistic-regression predictive performance

Coefficient heatmap

Takeaways