C4 — Has selection quality improved across cohorts?

Context

MATS has changed its selection process across cohorts: 7.0 and 8.0 were fully decentralized (each stream reviewed its own applicants), 9.0 added partial centralized review on top, and 10.0 went fully centralized. Did selection quality improve as a result?

This is a hard question because we can't observe quality directly — we observe proxies (mentor evals, SRP/FRP scores, post-program publications), each measured with different instruments across cohorts. We compare relative shape rather than raw levels, and flag caveats throughout.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Why this comparison is hard

Mentor-eval composite distributions

Cohort n Mean Median P25 P75 Frac ≥8/10
6.0 95 7.20 7.25 6.38 8.12 22%
7.0 76 7.34 7.38 6.50 8.00 33%
8.0 106 7.16 7.25 6.56 7.75 25%
9.0 93 7.52 7.50 6.75 8.25 40%

Publication rates by latest cohort

Latest cohort n P(has ≥1 pub) Median n_pubs
5.0 26 54% 1
5.1 25 76% 1
6.0 42 62% 1
6.1 33 94% 3
7.0 27 70% 2
7.1 46 80% 2
8.0 21 48% 0
8.1 75 63% 1
9.0 87 14% 0

Heavy recency confound here — 6.0 alumni have had ~2 years to publish, 9.0 only ~6 months. The apparent decline in publication rate is largely time-driven, not quality-driven.

SRP/FRP

Raw cohort means: 7.0=78.3, 8.0=79.8, 9.0=2.8. Cross-cohort comparison of raw scores is not meaningful — different rubrics. Within-cohort percentile is what we use for cross-cohort analyses elsewhere (e.g., C2, C3 implicitly).

Takeaways

  1. Mentor-eval composite distributions are remarkably stable across 6.0, 7.0, 8.0, and 9.0. Means and medians sit in the 7.0–7.5 range throughout; fraction "high quality" (≥8) hovers around 25–35%.
  2. No clear quality jump from 7.0 → 8.0 → 9.0. The shift from decentralized to partially-centralized selection in 9.0 doesn't produce a visible quality bump in the data we have. But: cohort sizes also grew, so even maintaining the same distribution while scaling up is itself a kind of improvement.
  3. We can't yet evaluate 10.0's fully-centralized approach — the program just started. Part C5 attempts a partial answer by retrospectively applying 10.0's LLM tier sorting to 8.0/9.0 resumes.
  4. For 11.0: if the goal is to improve selection quality, we shouldn't expect dramatic shifts in the mentor-eval distribution from changing selection process alone. The quality ceiling is likely set by the applicant pool, not the rubric.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Mentor evals: per-cohort all rows in mentor_X table. SRP/FRP: per-cohort table. Publication rates: alumni_pubs by latest-cohort attribution.

Outcome variable(s). Mentor: mean of four standardized dimensions (domain skill, research execution, AI safety knowledge, mission alignment). SRP/FRP: raw final_score. Publication: has_pub binary, n_pubs count.

Predictor fields. N/A — descriptive cross-cohort comparison.

Filters applied. None beyond per-table standard load.

Missing-data handling. No imputation; null rows excluded from per-cohort summaries.

Key assumptions / caveats.