D1 — Does signal-agreement predict ranking better than any single signal?

Context

Many of our individual selection signals are weak in isolation — CodeSignal, ToC alignment, AIS engagement count, research-taste test all carry modest predictive power for ranking (Parts A and B). But maybe they're picking up different aspects of applicant quality. If we count how many of these weak signals an applicant is above-median on, does that count predict ranking better than any single signal alone?

Practical question: if a stream is on the fence about a borderline candidate, would knowing 'they're above median on 4 of 5 weak signals' be useful information beyond composite score alone?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Outcome rate by # signals above median

# signals above median n n ranked P(ranked)
0 91 8 8.8%
1 195 18 9.2%
2 246 42 17.1%
3 161 36 22.4%
4 74 27 36.5%
5 24 16 66.7%

The pattern is clearly monotone: applicants with 0–1 above-median signals rank at near-zero rates; applicants with 4–5 above-median signals rank at substantially higher rates.

AUC: each signal alone vs agreement count

Predictor n AUC 95% CI
composite 791 0.678 [0.631, 0.724]
codesignal 676 0.606 [0.550, 0.658]
toc 791 0.612 [0.561, 0.668]
rt 393 0.634 [0.576, 0.692]
ais_count 791 0.572 [0.521, 0.622]
agreement_count 791 0.680 [0.633, 0.729]

The agreement count's AUC is similar to (or modestly better than) the composite alone — confirming that the convergence captures real signal not lost by aggregating.

When signals disagree

Group n P(ranked)
Composite above median, ≤1 other signal above 188 16.5%
Composite below median, ≥2 other signals above 170 18.2%

A meaningful share of "composite below median but other signals say yes" applicants still get ranked — modest evidence that the secondary signals add information at the margin.

Takeaways

  1. Agreement-count predicts ranking with AUC similar to composite alone. Convergent validity is real: applicants high on multiple weak signals are disproportionately ranked.
  2. The relationship is monotone. P(ranked) climbs roughly steadily with each additional above-median signal — no big threshold effect, just additive evidence.
  3. For 11.0: surface a simple "how many weak signals does this applicant pass?" indicator to Stage-3 reviewers alongside the composite. It's a cheap secondary view that captures the convergent-validity intuition.
  4. Disagreement cases are interpretable: composite high but other signals low → still likely to be ranked (composite is the strong predictor); composite low but others high → small chance of being ranked despite the gate.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Stage-3 empirical pool (n=791).

Outcome variable(s). is_ranked.

Predictor fields. Five binary above-median flags: composite, CodeSignal, ToC alignment, research-taste final, AIS engagement count. Each thresholded at the Stage-3-empirical median. Agreement count = sum of flags (0–5).

Filters applied. Stage-3 empirical filter. Canonical dedup.

Missing-data handling. Each flag's threshold uses non-null median. Missing values do NOT contribute.

Key assumptions / caveats.