Streams differ in what kind of research they do. An interpretability stream and a capability-evals stream probably value different things in an applicant. Today MATS uses one global composite score across all empirical streams. This analysis asks: do different families of streams actually weight applicant attributes differently when picking their rankings? If so, a single global composite is a compromise; per-cluster scoring (or at least per-cluster advisory signals) might help.
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).Restricted to streams whose Stage 1 application group includes Empirical (so we can use the empirical attribute tiers as predictors). Streams that dropped out by the time of this analysis (Garriga-Alonso, Emmons, Nasr) and Neel Nanda (separate selection process) are excluded.
* Shard appears in both A and C — its work spans interpretability and oversight/control. An applicant who applied to Shard contributes to both cluster pools.
Borderline assignments to be aware of: - Righetti (Biorisk + Security + Safeguards) — placed in B because biorisk reads as capability-evals-adjacent. - Dvijotham (Capability Evals + Security + Adversarial Robustness) — placed in B. - Parikh (Capability Evals + Control + Monitoring) — placed in B by first-listed category.
For each cluster I fit a logistic regression: "did the applicant get ranked by ≥1 stream in this cluster?" predicted from the five empirical attribute tiers (RS·relevance, MLE, SWE, Math, Soft Skills). The sample is applicants who actually applied to ≥1 stream in that cluster at Stage 3 — not the global Stage-3 empirical pool. So if 593 people applied to ≥1 cluster-A stream, that's the cluster-A denominator.
The coefficients say "how much each attribute pulls toward being ranked by streams in this cluster, in standard-deviation units." A positive coefficient means "more of this attribute → more likely to be ranked." Negative means "more of this attribute → less likely" (often this is a multicollinearity artifact, but sometimes a real signal — flag the big-magnitude negatives).
| Cluster | # streams | Applied to cluster | Ranked by cluster | Rate | n (reg) | AUC | RS·rel | MLE | SWE | Math | SS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| A — Empirical interpretability | 8 | 604 | 41 | 6.8% | 325 | 0.839 | +0.81 | +0.23 | +0.09 | +1.16 | +0.21 |
| B — Dangerous capability evals | 11 | 699 | 60 | 8.6% | 335 | 0.667 | +0.10 | +0.31 | +0.14 | -0.11 | -0.43 |
| C — AI control & oversight | 13 | 627 | 59 | 9.4% | 325 | 0.750 | +0.25 | +0.23 | +0.49 | +0.15 | +0.46 |
| D — misc | 1 | 178 | 4 | 2.2% | 73 | — | — | — | — | — | — |
| E — Security | 2 | 142 | 4 | 2.8% | 81 | — | — | — | — | — | — |
Read each row separately. The pattern of which attribute pulls toward "ranked by this cluster" tells you what that cluster cares about.
Sample. Per cluster the analysis sample is applicants who applied to ≥1 stream in that cluster at Stage 3 (from Stage 3 streams actually applied to). NOT track-restricted — a security or policy-track applicant who applied to a cluster's stream is included. Regression sample = cluster pool ∩ complete empirical attribute scores. Streams restricted to those whose Stage 1 application group contains 'Empirical' (34 streams). Dropouts (Garriga-Alonso, Emmons, Nasr) and Nanda excluded.
Outcome variable(s). Cluster-level binary: ranked by ≥1 stream in cluster X. streams_ranked_by entries matched against the cluster's display-name set.
Predictor fields. Five attribute scores: RS·rel, MLE, SWE, Math, SS. Standardized within each cluster's regression sample (different clusters have different applicant pools, so cross-cluster z-score comparisons are weaker than within-cluster comparisons across attributes).
Filters applied. Per-cluster pool: applied to ≥1 stream in cluster. Regression: + complete attributes. Logistic fitted only if ≥5 positives and ≥5 negatives in the regression sample.
Missing-data handling. Listwise drop on the 5 attribute features within each cluster.
Key assumptions / caveats.