D1 — Convergent validity

Context

Many of our individual selection signals are weak in isolation — CodeSignal, ToC alignment, AIS engagement count, research-taste test all carry modest predictive power for ranking (Parts A and B). But maybe they're picking up different aspects of applicant quality. If we count how many of these weak signals an applicant is above-median on, does that count predict ranking better than any single signal alone?

Practical question: if a stream is on the fence about a borderline candidate, would knowing 'they're above median on 4 of 5 weak signals' be useful information beyond composite score alone?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Outcome rate by # signals above median

The pattern is clearly monotone: applicants with 0–1 above-median signals rank at near-zero rates; applicants with 4–5 above-median signals rank at substantially higher rates.

AUC: each signal alone vs agreement count

The agreement count's AUC is similar to (or modestly better than) the composite alone — confirming that the convergence captures real signal not lost by aggregating.

When signals disagree

A meaningful share of "composite below median but other signals say yes" applicants still get ranked — modest evidence that the secondary signals add information at the margin.

# signals above median	n	n ranked	P(ranked)
0	91	8	8.8%
1	195	18	9.2%
2	246	42	17.1%
3	161	36	22.4%
4	74	27	36.5%
5	24	16	66.7%

Predictor	n	AUC	95% CI
composite	791	0.678	[0.631, 0.724]
codesignal	676	0.606	[0.550, 0.658]
toc	791	0.612	[0.561, 0.668]
rt	393	0.634	[0.576, 0.692]
ais_count	791	0.572	[0.521, 0.622]
agreement_count	791	0.680	[0.633, 0.729]

Group	n	P(ranked)
Composite above median, ≤1 other signal above	188	16.5%
Composite below median, ≥2 other signals above	170	18.2%

D1 — Does signal-agreement predict ranking better than any single signal?

Context

Outcome rate by # signals above median

AUC: each signal alone vs agreement count

When signals disagree

Takeaways