The 10.0 pipeline added a step where an LLM reviews each applicant's resume and sorts their experience into three tiers — Familiar, Applied, Expert — per topic / skill. Sanyu re-ran this 10.0-style tier-sorting on 8.0 and 9.0 applicant materials. This analysis asks whether the 10.0 feature engineering would have predicted in-program performance in those older cohorts.
Specifically: does using the 10.0-style tier counts as predictors give as much or more R² as using 9.0's actual centralized-review scores? If the 10.0 approach is at least as good, that's evidence the new pipeline is an improvement on what we had before.
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).The LLM reads an applicant's resume / experience text and labels each skill or topic as either: (a) Familiar — the applicant has seen it but no practical use evidence, (b) Applied — concrete use evidence in a project or job, (c) Expert — substantial experience with deep evidence (publications, leadership). We count how many items land in each tier and use the three counts as features.
This is a simplified summary; the actual 10.0 rubric is richer and feeds into Stage-2 attribute scores. But the tier-counts capture most of the signal in a comparable way across cohorts.
| Cohort | n (shared) | R² (10.0 features) | R² (9.0 centralized review) | R² (combined) |
|---|---|---|---|---|
| 8.0 | 86 | 0.047 | 0.161 | 0.201 |
| 9.0 | 59 | 0.064 | 0.183 | 0.212 |
Caveat: small samples and in-sample R²s. This is a hypothesis-generating comparison, not a definitive one.
Sample. 8.0 and 9.0 fellows with both (a) full 10.0-style tier-sort outputs and (b) 9.0-style centralized review scores AND (c) a mentor-eval row. Final n: {'8.0': 86, '9.0': 59}.
Outcome variable(s). Mean of standardized mentor-eval dimensions per fellow.
Predictor fields. 10.0-style: counts of Familiar / Applied / Expert items from the re-run LLM tier-sorter. 9.0 centralized review: original Research independence, Publication record, Technical execution, AI safety motivation.
Filters applied. Listwise complete on the joint feature set for fair R² comparison.
Missing-data handling. None — listwise-complete sample only.
Key assumptions / caveats.