The composite score combines five empirical attribute sub-scores. But Stage 2 also collected other signals about each applicant — a CodeSignal coding test, a research-taste test (taken by ~400 Stage-3 empirical applicants), an AI safety engagement score (duration + multi-select), and a Theory-of-Change (ToC) ranking alignment score that measured how the applicant prioritized AI risks. This analysis asks: do any of these add predictive value beyond what the composite already captures, or are they redundant?
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).Top incremental signal: Research taste Part 1 (Δ AUC +0.051 [-0.002, +0.104]).
Composite-only AUC on the Stage-3 empirical pool was 0.662 [0.587, 0.727]. The most informative additions are below — note that nominal Δ values can be small even when individually meaningful, because the composite already aggregates most of the signal.
| Added signal | n | AUC base | AUC full | Δ AUC | 95% CI |
|---|---|---|---|---|---|
| CodeSignal score | 676 | 0.677 | 0.688 | +0.011 | [-0.020, +0.039] |
| Research taste final | 393 | 0.624 | 0.669 | +0.045 | [-0.005, +0.095] |
| Research taste Part 1 | 394 | 0.625 | 0.676 | +0.051 | [-0.002, +0.104] |
| Research taste Part 2 | 393 | 0.624 | 0.655 | +0.031 | [-0.010, +0.071] |
| ToC alignment | 791 | 0.678 | 0.705 | +0.028 | [-0.006, +0.062] |
| AIS duration | 470 | 0.662 | 0.676 | +0.014 | [-0.014, +0.043] |
| AIS engagement count | 791 | 0.678 | 0.676 | -0.002 | [-0.013, +0.008] |
| AIS bundle (duration + count) | 470 | 0.662 | 0.676 | +0.013 | [-0.013, +0.043] |
⭐ = lower CI bound strictly above 0 (meaningful improvement on this sample).
If we drop the composite entirely and use the raw attribute tiers (RS·relevance, MLE, SWE, Math, SS) instead, AUC = 0.709 [0.637, 0.774], vs. composite-only 0.662 [0.587, 0.727] on the same subsample (n = 480). Δ = +0.046 [-0.012, +0.109].
If the CI brackets zero, the current composite is doing as good a job aggregating attributes as any unweighted linear combination — A3 will examine specific weight choices.
Univariate AUC for CodeSignal score → is_ranked (Stage-3 empirical, n = 676): 0.606 [0.552, 0.661].
If this is above 0.5, CodeSignal selects (predicts ranking) — the original 8.0 finding. C1 closes the loop by testing whether the same predictor also tracks mentor-eval scores (the performance side). The expected 8.0 paradox is positive selection AUC + null performance correlation.
Sample. Stage-3 empirical pool (n=791, ranked n=147, base rate 18.6%). Same as A1's Stage-3 view. Deduped to one row per person_id. Per-signal listwise drop (see n_complete column in the results table).
Outcome variable(s). is_ranked (ranked by ≥1 stream).
Predictor fields. Base: composite alone. Added one at a time:
- codesignal_score (derived, max across multi-attempt rollup)
- Research taste test: Final, Part 1, Part 2 scores (lists collapsed to scalar via first element)
- ToC alignment score (0–100)
- AIS engagement: duration (ordinalized: 0 = No experience … 5 = >4 years) and multi-select count (sum across categories)
- Attribute tiers (replacement model): RS × relevance multiplier, MLE, SWE, Math, SS — all read as numeric
Filters applied. Stage-3 empirical filter applied (true Stage-3 applications + Empirical selected at Stage 1). Special advances and topped-ups kept (they ARE Stage-3 applicants by construction). Nanda not excluded (pool-level analysis).
Missing-data handling. Listwise-complete per model (rows with any predictor missing dropped). Reported n_complete per row in the table. For sparse predictors (e.g., research-taste test only ~400 took it), n drops substantially.
Key assumptions / caveats.