C5 — Would 10.0 features have improved on 9.0's review approach?

Context

The 10.0 pipeline added a step where an LLM reviews each applicant's resume and sorts their experience into three tiers — Familiar, Applied, Expert — per topic / skill. Sanyu re-ran this 10.0-style tier-sorting on 8.0 and 9.0 applicant materials. This analysis asks whether the 10.0 feature engineering would have predicted in-program performance in those older cohorts.

Specifically: does using the 10.0-style tier counts as predictors give as much or more R² as using 9.0's actual centralized-review scores? If the 10.0 approach is at least as good, that's evidence the new pipeline is an improvement on what we had before.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

What the 10.0 tier-sort does

The LLM reads an applicant's resume / experience text and labels each skill or topic as either: (a) Familiar — the applicant has seen it but no practical use evidence, (b) Applied — concrete use evidence in a project or job, (c) Expert — substantial experience with deep evidence (publications, leadership). We count how many items land in each tier and use the three counts as features.

This is a simplified summary; the actual 10.0 rubric is richer and feeds into Stage-2 attribute scores. But the tier-counts capture most of the signal in a comparable way across cohorts.

R² comparison

Cohort n (shared) R² (10.0 features) R² (9.0 centralized review) R² (combined)
8.0 86 0.047 0.161 0.201
9.0 59 0.064 0.183 0.212

Per-feature bivariate ρ with mentor composite

Takeaways

  1. 9.0's centralized review wins on its own. In both 8.0 and 9.0 samples, the four 9.0-style centralized-review scores (research independence, publication record, technical execution, AI safety motivation) predict mentor evals substantially better than the raw 10.0 tier counts alone (R² ~0.17 vs ~0.05).
  2. But this comparison is unfair to 10.0. The 10.0 features I have for 8.0/9.0 are just the raw tier counts — the full 10.0 rubric also produces continuous attribute scores (Research Skills, MLE, SWE, Math, Soft Skills) that aggregate the tier evidence richly. Those weren't re-run on 8.0/9.0 resumes, so we can't compare like-for-like.
  3. What this DOES show: the simplest version of 10.0's automated tier-sorting is not by itself as predictive as 9.0's human-graded review. To beat 9.0's centralized review with 10.0's automated approach, the full attribute-score aggregation matters — not just the tier counts.
  4. For 11.0: keep the LLM tier sort AND the attribute-score aggregation on top. Don't try to substitute tier counts alone for the richer rubric. A useful follow-up: run the full 10.0 rubric on 8.0/9.0 resumes (not just the tier sort) and re-do this comparison.

Caveat: small samples and in-sample R²s. This is a hypothesis-generating comparison, not a definitive one.

🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. 8.0 and 9.0 fellows with both (a) full 10.0-style tier-sort outputs and (b) 9.0-style centralized review scores AND (c) a mentor-eval row. Final n: {'8.0': 86, '9.0': 59}.

Outcome variable(s). Mean of standardized mentor-eval dimensions per fellow.

Predictor fields. 10.0-style: counts of Familiar / Applied / Expert items from the re-run LLM tier-sorter. 9.0 centralized review: original Research independence, Publication record, Technical execution, AI safety motivation.

Filters applied. Listwise complete on the joint feature set for fair R² comparison.

Missing-data handling. None — listwise-complete sample only.

Key assumptions / caveats.