B9 — Does applying to more streams improve an applicant's chances?

Context

Stage-3 applicants choose which specific streams to apply to. Some apply to 1–2; others apply to 20+. Does applying to more streams improve an applicant's chance of being ranked? And if so, is that effect driven by the strategy itself, or by the fact that strong applicants apply broadly (higher composite scorers naturally have more relevant streams)?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Headline

Yes — but mostly because strong applicants apply broadly. The univariate effect of #streams → is_ranked is meaningful (AUC = 0.632 [0.590, 0.671]) but it's heavily confounded with composite (Spearman ρ between #streams and composite = +0.434).

In a joint model (n_streams + composite), composite carries most of the weight (standardized coefficient: composite = +0.19, n_streams = +0.28). The marginal effect of applying to more streams, after controlling for quality, is small.

P(ranked) by # streams

# streams n Mean # P(ranked) Mean composite
1–2 219 1.4 5.5% 1.29
3–5 246 3.9 18.7% 1.37
6–10 292 7.8 17.5% 2.07
11–15 142 12.3 26.8% 2.32
16–20 78 18.1 29.5% 2.67
20+ 82 26.5 23.2% 2.51

Mean composite by # streams

Composite rises monotonically with stream count — strong candidates apply broadly. The 1–2 bucket is mostly weaker candidates who didn't see many streams as relevant; the 20+ bucket is mostly strong empirical-track applicants who could reasonably make a case for many streams.

Scatter: # streams vs composite

Green = ranked, gray = not. High-composite + high-#streams concentrate in the top-right.

Takeaways

  1. The strategy effect, properly controlled, is small. Strong candidates get ranked at high rates regardless of whether they applied to 5 or 25 streams. Weak candidates don't get ranked even if they apply to 25.
  2. #streams is well-correlated with composite because high-quality applicants naturally have more "relevant" streams to choose from and are more confident in applying broadly.
  3. Should we cap #streams for 11.0? The data doesn't strongly support a cap — the marginal cost to reviewers of marginal applications is small (most low-quality applicants don't apply to many streams anyway). The main downside of allowing high #streams is reviewer-side: each stream sees more applications. If reviewer load is the constraint, a soft cap (e.g., "apply to up to 10") might help; this analysis can't tell us the optimal number.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Stage-3 applicants (n_s3_streams > 0): n=1,059. Joint logistic regression sample: n=1,059 (after dropping missing composite).

Outcome variable(s). is_ranked.

Predictor fields. n_s3_streams = len(Stage 3 streams actually applied to). composite_n = empirical composite (for the joint model).

Filters applied. Canonical dedup. Non-empty Stage-3 application list.

Missing-data handling. Listwise drop on composite for joint regression.

Key assumptions / caveats.