10.0 used two work tests at Stage 3: a research-taste test for the empirical track (the nanobot-scenario test, ~394 takers) and a policy/governance writing test (~82 takers). We've already looked at predictive validity (A5 / B5). This analysis is more granular: how do the score distributions actually look, are the sub-scores (Part 1 / Part 2) redundant, and is the overall policy score doing more than the grading-tier label?
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).| Ranked | n | Mean | Median | P25 | P75 |
|---|---|---|---|---|---|
| No | 291 | 50.4 | 50.4 | 44.7 | 55.2 |
| Yes | 103 | 54.2 | 54.1 | 49.9 | 60.1 |
Ranked applicants score visibly higher on average, with substantial overlap.
Spearman ρ between Part 1 and Part 2 scores = +0.64 (n = 394).
If ρ is high (>0.7), the two parts measure essentially the same thing — one of them could be dropped. If low-to-moderate (<0.5), they capture different aspects of research taste and both are worth keeping.
Each part's correlation with the empirical composite: Part 1 = +0.02, Part 2 = +0.05. Both modest.
| Ranked | n | Mean overall | Median overall |
|---|---|---|---|
| No | 66 | 3.2 | 3.5 |
| Yes | 16 | 3.2 | 3.5 |
The grading field for the policy/gov work test isn't a clean categorical tier — it's a generated rationale dict per response. So we look at the overall numeric score by ranked status instead. Spearman ρ between policy/gov work-test score and the empirical composite (descriptive only — most takers are policy/gov track) = -0.09.
Sample. Research-taste takers: n=394. Policy/gov work test takers: n=82.
Outcome variable(s). Score distributions (descriptive) + correlation with composite.
Predictor fields. Work-test scores.
Filters applied. Canonical dedup. Per-test sample = takers only.
Missing-data handling. Listwise drop on score columns.
Key assumptions / caveats.