Part A — Did the 10.0 selection pipeline work?
Context
MATS cohort 10.0 (summer 2026) was the first cohort to use a centralized application review — previously each stream reviewed its own applicants. The centralized process used an LLM-graded rubric at Stage 2 to produce a composite score that gated which ~600 applicants reached Stage 3 (stream-specific review).
This part is a validity check on that pipeline. The eight analyses look at: does the composite predict which applicants streams ultimately rank? Do other Stage-2 signals (CodeSignal, research-taste test, AI safety engagement) add value beyond the composite? Are the weights right? What about non-standard advancement pathways? Do different stream families weight things differently? And does the composite predict performance on an external work test designed by Anthropic?
How the 10.0 selection pipeline worked (click to expand)
~2,200 people applied. Each applicant went through three stages:
- Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
- Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
- Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. The cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offers are bounded by cohort size, ranks aren't.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked.
Headline findings
- The composite predicts ranking, but most of the apparent validity is the Stage-2 gate doing the work. Whole-pool empirical AUC = 0.82 [0.79, 0.85]; Stage-3-restricted AUC = 0.68 [0.63, 0.72]. A1 results.
- No single signal adds dramatic value over the composite. Best individual additions (research-taste Part 1, Δ AUC +0.05) have CIs grazing zero. CodeSignal univariate AUC = 0.61 → 8.0 paradox replicates on the selection side (C1 will close the loop on performance). A2 results.
- Current 50/35/15 weighting is close-to-optimal. Logistic-derived empirical weights give AUC 0.71 [0.64, 0.78] vs current 0.67 [0.60, 0.74]. The big surprise: the MLE attribute's empirical weight is at the floor (clipped from negative coefficient); Math and SS coefficients are also smaller than the current TE-split would suggest. A3 results.
- Special-advance pathways: CodeSignal specials substantially outperform the marginal regular advance. A4a: 5/14 = 36% CodeSignal specials were ranked, vs the bottom-14 of regularly-advanced empirical Stage 3 (0/14 = 0%). Neel Group B trainees (Stage 2 fails, n=14) were ranked 57% — mostly via Neel's separate process. Non-Nanda topped-ups (118) ranked at 11%. A4 results.
- Research-taste test carries comparable signal to the composite on the same subsample. AUC 0.63 [0.57, 0.69]. Tier labels are monotone in ranked rate. Cheaters scored high but were rarely ranked. A5 results.
- Stream clusters genuinely weight attributes differently — one-size-fits-all is a compromise. Empirical interp (incl. Steinhardt) values Math + RS·rel heavily; capability evals are roughly even but with NEGATIVE soft-skills coefficient; control/oversight values SWE + soft skills; foundational/misc values MLE. Per-cluster denominators (applicants who applied to ≥1 stream in cluster). A6 results.
- LLM stream recs: high recall, low precision. "Reject" calls almost never become rankings (LLM rejects ~precisely accurate); "advance" calls are permissive screening (high recall, ~3% precision for being ranked). Useful as a screen, not a final decision. A7 results.
- The composite does NOT predict performance on the Megastream takehome — but the sample is heavily selection-restricted. ρ = +0.046 [-0.174, +0.263]. CodeSignal slightly better: ρ = +0.147. n = 87. A8 results.
11.0 implications (tentative)
- Keep the composite as a Stage-2 gate — it does most of the gate work effectively (A1 funnel: Q1–Q3 essentially never reach Stage 3, Q5 advancement rate ~100%).
- Reconsider the TE sub-weights. Empirical evidence (A3) suggests MLE is over-weighted relative to its predictive value on the ranking outcome. Consider whether MLE captures something we care about that isn't observable in stream rankings (e.g., it could matter for downstream performance even if not selection — Part C will look).
- Per-cluster scoring is worth piloting. A6 shows real divergence in what clusters value. Even an advisory per-cluster score next to the global composite would help streams calibrate.
- LLM stream rec prompt redesign. Use A7's per-stream precision/recall split to target which streams need more conservative prompts (low precision) vs. which need more inclusive prompts (high precision but missing rankings).
- Composite's null relationship with Megastream takehome (A8) is a yellow flag, consistent with the 8.0 paradox. Don't over-interpret on n=87, but weight this against C-series results before drawing process conclusions.
Individual reports
Errors encountered during Part A
None unrecovered. Pre-flight surfaced and fixed a critical data.py bug (pandas StringDtype changes silently broke JSON parsing in _parse_json_columns → is_ranked and CodeSignal-related columns were all empty in the anon CSV). Fixed before Part A began. See preflight P4 contract tests (now 16/16 passing) for the regression gate.
Minor in-flight fixes: (1) AIS duration is free-text, not multiple-choice — added regex parser. (2) A4 "bottom of regularly-advanced" reference initially included topped-ups (which have engineered-low composites by design) — fixed to exclude both specials and topped-ups. (3) Stream rankings use [w] Display name, not Internal handle; mapping handled in _common.py helpers.