A2 — Incremental validity

Context

The composite score combines five empirical attribute sub-scores. But Stage 2 also collected other signals about each applicant — a CodeSignal coding test, a research-taste test (taken by ~400 Stage-3 empirical applicants), an AI safety engagement score (duration + multi-select), and a Theory-of-Change (ToC) ranking alignment score that measured how the applicant prioritized AI risks. This analysis asks: do any of these add predictive value beyond what the composite already captures, or are they redundant?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Definitions used in this analysis

CodeSignal score — applicant's score on the industry-standard CodeSignal Industry Coding Assessment. 10.0 used a custom variant ("MATS Chatbot Service" test).
Research-taste test — a Stage-3 work test in which applicants evaluate a research scenario; produces a final score plus a tier label. Only some streams opted in, so ~400 of ~600 Stage-3 empirical applicants took it.
ToC alignment score — a 0–100 score capturing how well the applicant's ranking of AI safety threat models aligned with a reference Theory of Change.
AIS engagement — multi-select fields asking what AI-safety courses/programs/orgs the applicant has been involved with, plus a duration field for how long they've been engaging with AI safety.
The 8.0 paradox — in cohort 8.0 (autumn 2025), CodeSignal score predicted who got accepted (AUC ≈ 0.77) but did NOT correlate with mentor evaluations of in-program performance (r ≈ 0). This raises the question of whether CodeSignal is selecting on something real or on a proxy that doesn't matter for actual research output.

Headline

Top incremental signal: Research taste Part 1 (Δ AUC +0.051 [-0.002, +0.104]).

Composite-only AUC on the Stage-3 empirical pool was 0.662 [0.587, 0.727]. The most informative additions are below — note that nominal Δ values can be small even when individually meaningful, because the composite already aggregates most of the signal.

Incremental Δ AUC (composite + signal vs. composite alone)

Attribute tiers vs. composite

Added signal	n	AUC base	AUC full	Δ AUC	95% CI
CodeSignal score	676	0.677	0.688	+0.011	[-0.020, +0.039]
Research taste final	393	0.624	0.669	+0.045	[-0.005, +0.095]
Research taste Part 1	394	0.625	0.676	+0.051	[-0.002, +0.104]
Research taste Part 2	393	0.624	0.655	+0.031	[-0.010, +0.071]
ToC alignment	791	0.678	0.705	+0.028	[-0.006, +0.062]
AIS duration	470	0.662	0.676	+0.014	[-0.014, +0.043]
AIS engagement count	791	0.678	0.676	-0.002	[-0.013, +0.008]
AIS bundle (duration + count)	470	0.662	0.676	+0.013	[-0.013, +0.043]

If we drop the composite entirely and use the raw attribute tiers (RS·relevance, MLE, SWE, Math, SS) instead, AUC = 0.709 [0.637, 0.774], vs. composite-only 0.662 [0.587, 0.727] on the same subsample (n = 480). Δ = +0.046 [-0.012, +0.109].

If the CI brackets zero, the current composite is doing as good a job aggregating attributes as any unweighted linear combination — A3 will examine specific weight choices.

CodeSignal paradox marker

Univariate AUC for CodeSignal score → is_ranked (Stage-3 empirical, n = 676): 0.606 [0.552, 0.661].

If this is above 0.5, CodeSignal selects (predicts ranking) — the original 8.0 finding. C1 closes the loop by testing whether the same predictor also tracks mentor-eval scores (the performance side). The expected 8.0 paradox is positive selection AUC + null performance correlation.

A2 — Do additional Stage-2 signals add value over the composite?

Context

Headline

Incremental Δ AUC (composite + signal vs. composite alone)

Attribute tiers vs. composite

CodeSignal paradox marker

Takeaways