A1 — Does the composite score predict stream rankings?

Context

The composite score is the single number 10.0 used to decide who advances from Stage 2 to Stage 3. It combines an applicant's Research Skills, Technical Execution, and Soft Skills attribute tiers into one value. We need to know whether the composite predicts what we actually care about: getting picked by a stream. This is the first sanity check — if the composite barely correlates with stream rankings, we'd need to rethink the rubric entirely.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Headline

Headline. In the whole empirical pool, composite predicts ranking with AUC = 0.823 [0.789, 0.855]. Restricted to Stage 3 (range-restricted, more conservative), AUC = 0.678 [0.631, 0.724]. 9.0 baseline (re-derived) AUC = 0.718 [0.669, 0.765], n=951.

The whole-pool AUC is very high partly because the composite was used to gate Stage 3 admission — applicants with very low composite never reach a stream's review. The Stage-3 AUC is the more honest test of "given that you cleared the gate, does composite predict stream decisions?"

AUC summary

View Outcome n n_pos base rate AUC [95% CI]
Whole empirical pool is_ranked 1,683 147 0.087 0.823 [0.789, 0.855]
Whole empirical pool is_invited 1,683 404 0.240 0.805 [0.779, 0.829]
Stage-3 empirical (range-restricted) is_ranked 791 147 0.186 0.678 [0.631, 0.724]
Stage-3 empirical (range-restricted) is_invited 791 393 0.497 0.655 [0.617, 0.692]
Whole 10.0 sample (all tracks) is_ranked 2,203 189 0.086 0.712 [0.666, 0.757]
Whole 10.0 sample (all tracks) is_invited 2,203 519 0.236 0.700 [0.669, 0.729]

The whole-pool inflation comes from the gate: rejected-at-Stage-2 applicants have zero chance of ranking. Calibration plots below show this directly.

Calibration

The probability of being ranked rises monotonically with composite score; the bottom 60% of the pool has essentially zero chance, while the top decile has roughly 60–70%.

ROC by view

Composite distribution by outcome (Stage-3 pool)

Visible separation between ranked vs. not-ranked, but substantial overlap — Stage-3 AUC of 0.678 reflects this.

Composite vs. # streams ranking applicant

Spearman ρ = 0.250 (Stage-3 empirical, n = 791). Higher composite → ranked by more streams on average, but the relationship is noisy.

Stage-wise advancement by quintile

Composite quintile n P(Stage 3) P(invited) P(ranked) P(passed bar)
Q1 336 0.164 0.048 0.009 0.009
Q2 337 0.297 0.136 0.027 0.027
Q3 337 0.312 0.104 0.030 0.030
Q4 335 0.579 0.257 0.084 0.084
Q5 338 0.997 0.654 0.287 0.287

The funnel is dominated by the Stage-3-gate effect: Q1–Q3 essentially never reach Stage 3; advancement starts in Q4 and concentrates in Q5.

9.0 comparison (re-derived)

Re-fit a baseline analog on 9.0: average of z-scored centralized review components — research independence, publication record, technical execution, AI safety motivation — predicting passed_mentors_bar (true offer data, 9.0 has it).

The two AUCs are not directly comparable: 9.0 baseline is across the full 9.0 applicant pool predicting offered, while 10.0 is predicting ranked. Also the 9.0 reviewers and the 10.0 LLM-graded rubric are different instruments. Treat the comparison as directional only.

Takeaways

  1. The composite has real predictive validity for stream rankings, even within the Stage-3 pool (which is the harder test). AUC ≈ 0.68 vs. 0.5 chance.
  2. Most of the apparent whole-pool predictive power is the Stage-2 gate doing the work — applicants who didn't clear the composite bar essentially never get ranked by anyone. This makes the Stage-3-only AUC the more honest test.
  3. The composite predicts not just whether someone is ranked, but by how many streams (Spearman ρ ≈ 0.25). High-composite applicants get ranked by multiple streams — suggesting the composite captures a "broadly desirable" axis rather than fitting one stream's idiosyncrasies.

A note on the 9.0 comparison

9.0 was the cohort before 10.0 (spring 2026). Its selection process was different — decentralized stream review with a partial centralized review on top. The 0.77 AUC figure from the 9.0 summary was for a more comprehensive multi-feature model, so the simple re-derived baseline shown here isn't directly comparable; it's a sanity check that the composite numbers we're seeing aren't wildly out of line with prior cohorts.

🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Three views for 10.0: (a) whole empirical pool — all applicants who selected the Empirical track at Stage 1 (n=1,683); (b) Stage-3 empirical — empirical-track applicants with a non-empty Stage 3 application list (n=791); (c) whole 10.0 sample, any track (n=2,203). All deduped to one row per person_id (kept the row with furthest Furthest stage reached, tie-broken by composite). Nanda is not excluded here because A1 is pool-level — Nanda exclusion only kicks in for per-stream analyses.

Outcome variable(s). Primary: is_ranked (ranked by ≥1 stream). Secondary: is_invited_to_worktest — broader engagement pool, strict superset of is_ranked (covers any stream-side review status, Megastream takehome path, and ranking-without-worktest streams).

Predictor fields. [stage-2-empirical-review] Empirical Composite score — pre-computed 0.50·RS + 0.35·TE + 0.15·SS (per CLAUDE.md 10.0 scoring; relevance multiplier already applied to RS only; hard-floor dropped per project memory). Read directly; no recomputation. Values range 0–4.

Filters applied. Empirical-track filter applied for the empirical views via [stage-1-track] Selected tracks containing 'Empirical'. Stage-3 filter via non-empty Stage 3 streams actually applied to. No additional exclusions (special advances and topped-ups included — they are part of the real pool and A4 will look at them separately).

Missing-data handling. Listwise drop for AUC: rows with missing composite excluded (composite_all.notna() mask). Composite is non-null for 2203/2203 canonical rows.

Key assumptions / caveats.