Part A — Did the 10.0 selection pipeline work?

Context

MATS cohort 10.0 (summer 2026) was the first cohort to use a centralized application review — previously each stream reviewed its own applicants. The centralized process used an LLM-graded rubric at Stage 2 to produce a composite score that gated which ~600 applicants reached Stage 3 (stream-specific review).

This part is a validity check on that pipeline. The eight analyses look at: does the composite predict which applicants streams ultimately rank? Do other Stage-2 signals (CodeSignal, research-taste test, AI safety engagement) add value beyond the composite? Are the weights right? What about non-standard advancement pathways? Do different stream families weight things differently? And does the composite predict performance on an external work test designed by Anthropic?

How the 10.0 selection pipeline worked (click to expand)

~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions used throughout

Headline findings

  1. The composite predicts ranking, but most of the apparent validity is the Stage-2 gate doing the work. Whole-pool empirical AUC = 0.82 [0.79, 0.85]; Stage-3-restricted AUC = 0.68 [0.63, 0.72]. A1 results.
  2. No single signal adds dramatic value over the composite. Best individual additions (research-taste Part 1, Δ AUC +0.05) have CIs grazing zero. CodeSignal univariate AUC = 0.61 → 8.0 paradox replicates on the selection side (C1 will close the loop on performance). A2 results.
  3. Current 50/35/15 weighting is close-to-optimal. Logistic-derived empirical weights give AUC 0.71 [0.64, 0.78] vs current 0.67 [0.60, 0.74]. The big surprise: the MLE attribute's empirical weight is at the floor (clipped from negative coefficient); Math and SS coefficients are also smaller than the current TE-split would suggest. A3 results.
  4. Special-advance pathways: CodeSignal specials substantially outperform the marginal regular advance. A4a: 5/14 = 36% CodeSignal specials were ranked, vs the bottom-14 of regularly-advanced empirical Stage 3 (0/14 = 0%). Neel Group B trainees (Stage 2 fails, n=14) were ranked 57% — mostly via Neel's separate process. Non-Nanda topped-ups (118) ranked at 11%. A4 results.
  5. Research-taste test carries comparable signal to the composite on the same subsample. AUC 0.63 [0.57, 0.69]. Tier labels are monotone in ranked rate. Cheaters scored high but were rarely ranked. A5 results.
  6. Stream clusters genuinely weight attributes differently — one-size-fits-all is a compromise. Empirical interp (incl. Steinhardt) values Math + RS·rel heavily; capability evals are roughly even but with NEGATIVE soft-skills coefficient; control/oversight values SWE + soft skills; foundational/misc values MLE. Per-cluster denominators (applicants who applied to ≥1 stream in cluster). A6 results.
  7. LLM stream recs: high recall, low precision. "Reject" calls almost never become rankings (LLM rejects ~precisely accurate); "advance" calls are permissive screening (high recall, ~3% precision for being ranked). Useful as a screen, not a final decision. A7 results.
  8. The composite does NOT predict performance on the Megastream takehome — but the sample is heavily selection-restricted. ρ = +0.046 [-0.174, +0.263]. CodeSignal slightly better: ρ = +0.147. n = 87. A8 results.

11.0 implications (tentative)

Individual reports

AnalysisQuestionn
A1 — Composite predictive validityDoes composite predict ranking?1,683 empirical pool / 791 Stage 3
A2 — Incremental validityWhich signals add value over composite?791 (varies by predictor)
A3 — Optimal composite weightsAre 50/35/15 weights right?791
A4 — Special advancesDid non-standard pathways add value?14 (CodeSignal) / 23 (Neel) / 133 (topup)
A5 — Research taste testDoes the test predict ranking?394 takers
A6 — Per-stream-cluster profilesDo clusters weight differently?per-cluster pool, 142–699 applied
A7 — LLM stream recs accuracyDo LLM advance/reject match actual ranking?9,143 (applicant × stream) recs
A8 — Megastream takehome validationDoes composite predict external takehome?87 takers

Errors encountered during Part A

None unrecovered. Pre-flight surfaced and fixed a critical data.py bug (pandas StringDtype changes silently broke JSON parsing in _parse_json_columnsis_ranked and CodeSignal-related columns were all empty in the anon CSV). Fixed before Part A began. See preflight P4 contract tests (now 16/16 passing) for the regression gate.

Minor in-flight fixes: (1) AIS duration is free-text, not multiple-choice — added regex parser. (2) A4 "bottom of regularly-advanced" reference initially included topped-ups (which have engineered-low composites by design) — fixed to exclude both specials and topped-ups. (3) Stream rankings use [w] Display name, not Internal handle; mapping handled in _common.py helpers.