The empirical pipeline gets most of the attention, but ~445 10.0 applicants applied through Policy & Strategy (P&S) and ~250 through Technical Governance (TG). These tracks use a different Stage-2 rubric — Research Skills is paired with Analytical Communication (instead of Technical Execution), and Research Skills gets two relevance multipliers (policy relevance and technical relevance) rather than one. Does the policy/gov rubric work?
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).P&S composite → is_ranked (P&S applicants only, n=445): AUC = 0.706 [0.584, 0.808]. TG composite → is_ranked (TG applicants only, n=373): AUC = 0.752 [0.645, 0.851].
Both composites discriminate above chance. Compared to the empirical composite (A1 full-pool AUC ≈ 0.82), policy/gov composites' performance is similar in magnitude.
| Sample | Predictor | n | n_pos | AUC | 95% CI |
|---|---|---|---|---|---|
| P&S applicants | ps_composite |
445 | 30 | 0.706 | [0.584, 0.808] |
| TG applicants | tg_composite |
373 | 25 | 0.752 | [0.645, 0.851] |
| Combined P&S+TG | ps_composite |
584 | 39 | 0.724 | [0.633, 0.810] |
| Policy review sample (n≈424) | Research Skills (P/G rating) |
423 | 32 | 0.790 | [0.730, 0.845] |
| Policy review sample (n≈424) | Analytical Communication |
423 | 32 | 0.664 | [0.573, 0.742] |
| Policy review sample (n≈424) | Soft Skills (P/G) |
423 | 32 | 0.607 | [0.517, 0.693] |
Policy relevance (P&S applicants):
Adjacent: 238Directly relevant: 142Distant: 43Technical relevance (TG applicants):
Adjacent: 183Distant: 124Directly relevant: 116Most applicants are tagged Direct or Adjacent on at least one relevance axis. Few are tagged Distant on both — the relevance system is working as designed.
Among 82 policy/gov applicants who completed the work test, work-test score → is_ranked AUC = 0.533 [0.378, 0.685] (small n; CI is wide). The grading tier (Exceeds/Meets/Near/Below) tracks scores monotonically — usable categorical summary.
Applicants with low Research Skills tier (≤2) in the policy review sample: 91 applicants. Of these, 0 were ranked. The rescue rate (low-RS → ranked) is 0.0%.
Sample. P&S track applicants: n=445. TG track applicants: n=373. Combined: n=584. Policy/gov-review sample (anyone with a non-null P/G RS score): n=423.
Outcome variable(s). is_ranked (any stream).
Predictor fields. [stage-2-policy-gov-review] Policy & Strategy composite score, [stage-3-policy-gov] Technical Governance composite score, and the underlying tier ratings (RS, A&C, SS). Work-test overall score for the n=82 takers.
Filters applied. Canonical dedup. Track masks from [stage-1-track] Selected tracks.
Missing-data handling. Per-predictor listwise drop.
Key assumptions / caveats.