B6 — How does composite percentile predict outcomes in detail?

Context

A1 showed the composite predicts ranking. This analysis zooms in: within the Stage-3 empirical pool, how does outcome probability change with composite percentile? Specifically — is there a floor below which almost no one gets ranked? Is there a ceiling above which additional composite stops helping? Or is the relationship roughly linear?

Practical use: if the curve has a sharp floor, we can treat composite as a hard filter (anyone below X percentile is auto-rejected and doesn't waste reviewer time). If it's smoothly linear, composite is best treated as an informational signal that streams weigh alongside their other judgment.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Headline

The relationship between composite percentile and ranking is concave with a strong floor. At decile D1, P(ranked) first crosses 5%. Below that, composite is essentially a hard floor — no upside to including those applicants in further review. The biggest single-decile jump in P(ranked) is D4→D5 (+26.7 percentage points). The jump from D9→D10 adds another +2.8 pp.

Percentile curve

Decile breakdown

Decile n Mean percentile P(invited) P(ranked) P(offered)
D1 58 5.03 50.0% 19.0% 19.0%
D2 59 15.16 39.0% 6.8% 6.8%
D3 56 25.17 51.8% 30.4% 30.4%
D4 62 35.53 33.9% 1.6% 1.6%
D5 53 45.49 62.3% 28.3% 28.3%
D6 61 55.37 52.5% 16.4% 16.4%
D7 53 65.25 62.3% 26.4% 26.4%
D8 58 74.87 67.2% 27.6% 27.6%
D9 57 84.95 73.7% 35.1% 35.1%
D10 58 94.97 70.7% 37.9% 37.9%

Takeaways

  1. There is a strong composite floor. Applicants in the bottom several deciles of Stage-3 composite essentially never get ranked. Even within the already-selected Stage-3 pool, the bottom of the distribution converts at near-zero rates.
  2. Diminishing returns kick in around the upper deciles. The marginal benefit of moving from D8 to D9 to D10 in composite is large but decreasing — the curve flattens at the top.
  3. For 11.0 design: this supports treating composite as a gating score (hard floor at a defined percentile) rather than just an informational signal. The hard-floor design avoids streams spending time on Stage-3 applicants who have effectively zero chance based on the data.
  4. Calibration of the floor: a floor around the 30th–40th percentile of the Stage-3 pool would eliminate a meaningful chunk of low-yield review without dropping anyone who had a real chance.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Stage-3 empirical pool with a non-null percentile field (n=575). Percentile uses the pre-computed [stage-3-empirical] Empirical composite score percentile — rank within the Stage-3 empirical pool.

Outcome variable(s). is_ranked, is_invited_to_worktest, passed_mentors_bar.

Predictor fields. [stage-3-empirical] Empirical composite score percentile — continuous 0–1.

Filters applied. Stage-3 empirical filter (same as A1). Canonical dedup.

Missing-data handling. Listwise drop on percentile.

Key assumptions / caveats.