MATS fellow selection — cross-part analysis

Comprehensive analysis of MATS fellow selection across cohorts 6.0–10.0, motivated by the design of Autumn 2026 (11.0) selection. Generated 2026-05-10.

What this is

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship. Cohort 10.0 (summer 2026) was the first cohort to use a centralized application review; previously each research stream reviewed its own applicants. This analysis evaluates how the 10.0 process worked and informs the design of 11.0 (autumn 2026). Findings draw on five cohorts of application data, mentor evaluations, SRP/FRP reviews, alumni publications, and the Q3 2025 alumni survey.

How the 10.0 selection pipeline worked (click to expand)

~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — applicants submitted background, picked tracks (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure), and took an LLM-graded screen. The LLM also produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track produced a composite score combining Research Skills (with relevance multiplier), Technical Execution (MLE / SWE / Math), and Soft Skills. Top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific streams to apply to. Each stream reviewed and ranked its applicants. ~120 offers were made; ~63 additional applicants were waitlisted.

The 4 parts

PartQuestion# analyses
A — 10.0 pipeline validationDid the central 10.0 rubric work?8
B — 10.0 process & design questionsHow should we design specific 11.0 components?9
C — Cross-cohort validationDo findings replicate across cohorts 6.0–10.0?8
D — Convergent & exploratoryFactor structure, stream consistency, signal convergence6

31 analyses in total. Each individual analysis has its own writeup with a context block, headline, charts, tables, and a debug-mode methodology callout.

The most robust findings

These are the conclusions that I'd stake the highest confidence in — supported across cohorts, across measurement instruments, or by multiple independent analyses.

  1. The composite score works as a Stage-2 gate, especially below the 30th-percentile floor.

    Sources: A1 (whole-pool AUC 0.82, Stage-3 AUC 0.68); B6 (Stage-3 percentile curve is strongly concave with near-zero rank rates in bottom deciles).

  2. The CodeSignal paradox is real and replicates across 3 cohorts. CodeSignal predicts admission (AUC 0.70–0.78) but does NOT predict mentor evaluations of in-program performance (ρ ≈ 0).

    Sources: A2 (10.0 selection-side replication); C1 (cross-cohort replication); A8 (no signal for external Megastream takehome either).

  3. Application features explain at most ~25% of mentor-eval variance. Selection from applications is fundamentally noisy. Don't over-optimize.

    Sources: C2 (R²: 7.0=0.08, 8.0=0.34, 9.0=0.26); A8 (composite ↔ takehome ρ ≈ 0 with severe range restriction).

  4. Different stream families weight applicant attributes differently. Empirical interpretability values Math + RS heavily; capability evals are roughly even with negative soft-skills coefficient; control/oversight values SWE + soft skills.

    Sources: A6 (per-cluster regressions); D2 (per-stream consistency with composite varies widely).

  5. Mentor evaluations are essentially a single "overall quality" factor. PC1 explains 60–70% of variance; all 4 sub-dimensions load together (halo effect).

    Source: D5.

  6. Mentor-eval distributions are remarkably stable across cohorts. 6.0/7.0/8.0/9.0 all have mean composite 7.2–7.5/10 and "high quality" share ~25–35%. The shift from decentralized to partial centralized review in 9.0 didn't produce a visible quality jump.

    Source: C4.

  7. Returning applicants outperform first-timers — but mostly via clearing earlier gates. Conditional on reaching Stage 3, the gap narrows substantially.

    Source: B8.

  8. AI-safety-org references carry the strongest reference-type signal for selection (B4) and modestly for mentor evals (D3, but mostly absorbed by other features).

    Sources: B4; D3.

The clearest 11.0 implications

Things I'd recommend acting on based on these findings:

Cautions / what NOT to conclude

Methodology notes

Known caveats and data issues (documented during the run)