MATS fellow selection — cross-part analysis

Comprehensive analysis of MATS fellow selection across cohorts 6.0–10.0, motivated by the design of Autumn 2026 (11.0) selection. Generated 2026-05-10.

What this is

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship. Cohort 10.0 (summer 2026) was the first cohort to use a centralized application review; previously each research stream reviewed its own applicants. This analysis evaluates how the 10.0 process worked and informs the design of 11.0 (autumn 2026). Findings draw on five cohorts of application data, mentor evaluations, SRP/FRP reviews, alumni publications, and the Q3 2025 alumni survey.

How the 10.0 selection pipeline worked (click to expand)

~2,200 people applied. Each applicant went through three stages:

Stage 1 — applicants submitted background, picked tracks (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure), and took an LLM-graded screen. The LLM also produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track produced a composite score combining Research Skills (with relevance multiplier), Technical Execution (MLE / SWE / Math), and Soft Skills. Top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific streams to apply to. Each stream reviewed and ranked its applicants. ~120 offers were made; ~63 additional applicants were waitlisted.

The 4 parts

Part	Question	# analyses
A — 10.0 pipeline validation	Did the central 10.0 rubric work?	8
B — 10.0 process & design questions	How should we design specific 11.0 components?	9
C — Cross-cohort validation	Do findings replicate across cohorts 6.0–10.0?	8
D — Convergent & exploratory	Factor structure, stream consistency, signal convergence	6

31 analyses in total. Each individual analysis has its own writeup with a context block, headline, charts, tables, and a debug-mode methodology callout.

The most robust findings

These are the conclusions that I'd stake the highest confidence in — supported across cohorts, across measurement instruments, or by multiple independent analyses.

The composite score works as a Stage-2 gate, especially below the 30th-percentile floor.
Sources: A1 (whole-pool AUC 0.82, Stage-3 AUC 0.68); B6 (Stage-3 percentile curve is strongly concave with near-zero rank rates in bottom deciles).
The CodeSignal paradox is real and replicates across 3 cohorts. CodeSignal predicts admission (AUC 0.70–0.78) but does NOT predict mentor evaluations of in-program performance (ρ ≈ 0).
Sources: A2 (10.0 selection-side replication); C1 (cross-cohort replication); A8 (no signal for external Megastream takehome either).
Application features explain at most ~25% of mentor-eval variance. Selection from applications is fundamentally noisy. Don't over-optimize.
Sources: C2 (R²: 7.0=0.08, 8.0=0.34, 9.0=0.26); A8 (composite ↔ takehome ρ ≈ 0 with severe range restriction).
Different stream families weight applicant attributes differently. Empirical interpretability values Math + RS heavily; capability evals are roughly even with negative soft-skills coefficient; control/oversight values SWE + soft skills.
Sources: A6 (per-cluster regressions); D2 (per-stream consistency with composite varies widely).
Mentor evaluations are essentially a single "overall quality" factor. PC1 explains 60–70% of variance; all 4 sub-dimensions load together (halo effect).
Source: D5.
Mentor-eval distributions are remarkably stable across cohorts. 6.0/7.0/8.0/9.0 all have mean composite 7.2–7.5/10 and "high quality" share ~25–35%. The shift from decentralized to partial centralized review in 9.0 didn't produce a visible quality jump.
Source: C4.
Returning applicants outperform first-timers — but mostly via clearing earlier gates. Conditional on reaching Stage 3, the gap narrows substantially.
Source: B8.
AI-safety-org references carry the strongest reference-type signal for selection (B4) and modestly for mentor evals (D3, but mostly absorbed by other features).
Sources: B4; D3.

The clearest 11.0 implications

Things I'd recommend acting on based on these findings:

Drop or de-emphasize CodeSignal in selection. Three cohorts of evidence say it predicts admission but not performance (C1). 10.0 was already moving this direction; 11.0 should commit.
Treat composite as a hard floor at ~30–40th percentile of Stage 3, not just informational. B6's percentile curve shows the bottom deciles essentially never get ranked.
Make Stage-1 stream selection optional / low-friction. Applicants and streams find each other late (B2); early stream selection adds friction more than value for most.
Surface per-stream-cluster signals to reviewers as advisory inputs. The composite is a compromise across clusters that weight things differently (A6); cluster-specific advisory scores would help streams calibrate.
Keep the "publication record" review prompt. Most consistent signal across cohorts (C2/C3).
Surface the AI-safety-org reference flag to Stage-3 reviewers (B4). Strongest single reference-type signal; cheap to highlight.
Continue (but don't blindly trust) the ToC alignment score. It's a cheap, effective screen (B3). The flat AIS-engagement multi-select is probably not pulling its weight — consider replacing with depth-capturing prompts.
Make AI-use policy more explicit. The Pangram fraction_ai signal is being implicitly penalized (B7); an explicit policy would deter use and improve fairness.
Simplify the research-taste test to one part instead of two based on Part 1 / Part 2 redundancy (D4).
Track returning applicants by prior cohort + prior outcome. Disentangling self-selection from genuine improvement requires this (B8).

Cautions / what NOT to conclude

Don't expect dramatic quality improvements from changing selection process alone. 6.0–9.0 mentor-eval distributions look essentially the same (C4); the quality ceiling is set more by the applicant pool than by the rubric.
Don't trust the C5 result that "10.0 features didn't beat 9.0 centralized review." The comparison only had access to tier counts, not the full 10.0 attribute aggregation. A proper retrospective comparison would require re-running the full 10.0 rubric on 8.0/9.0 resumes.
Don't over-interpret single-cohort findings. Especially small-sample analyses (A4 Neel, A8 takehome, D3 references). Cross-cohort consistency is the strongest evidence.
Don't read the alumni survey as unbiased. It's opt-in and skews toward engaged/successful alumni (C8). Use as upper-bound triangulation.
Don't assume FRP/SRP is a clean proxy for mentor signal. D6 shows weak and decreasing correlation across cohorts.

Methodology notes

Canonical analysis sample for 10.0: 2,203 rows after dedup (one row per person_id, kept furthest stage + max composite tie-break).
Primary outcome: is_ranked = ranked by ≥1 stream. Secondary: is_invited_to_worktest = engaged by ≥1 stream (broader funnel level).
Stream cluster assignments (used in A6, D2) restricted to streams with Empirical in Stage-1 application group; dropouts (Garriga-Alonso, Emmons, Nasr) and Nanda excluded.
Statistical stance: Effect sizes + bootstrap 95% CIs (2,000 reps). No global multiple-comparisons correction; cross-cohort replication is the primary evidence.
Anonymization: all analyses run on source_data_anon/. PII columns dropped; person_id resolves identity via the identity graph.

Known caveats and data issues (documented during the run)

10.0 AIS engagement form bug — secondary "research program" / "structured course" detail panels were swapped. Main multi-select is unaffected; B3 and other AIS-related analyses use only unaffected fields. Saved to project memory.
Streams that dropped out before / during 10.0: Garriga-Alonso, Emmons, Nasr. Excluded from cluster analyses (A6, D2).
Neel Nanda's process is parallel to MATS: excluded from per-stream analyses but kept in pool-level analyses.
Within-cohort duplicate person_ids exist (apps_6: 14, apps_7: 3, apps_8: 24, apps_9: 6, apps_10: 7). Deduped via furthest-stage + max-composite tie-break.
10.0 has no mentor-eval data yet — program just started. Mentor-eval predictive analyses (C2, C5, D3, D5) cover 6.0/7.0/8.0/9.0 only.
Pandas StringDtype bug in data.py was caught and fixed during preflight; otherwise is_ranked and CodeSignal-related columns would have been silently empty. 16/16 contract tests now gate against regressions.