A2 — Do additional Stage-2 signals add value over the composite?

Context

The composite score combines five empirical attribute sub-scores. But Stage 2 also collected other signals about each applicant — a CodeSignal coding test, a research-taste test (taken by ~400 Stage-3 empirical applicants), an AI safety engagement score (duration + multi-select), and a Theory-of-Change (ToC) ranking alignment score that measured how the applicant prioritized AI risks. This analysis asks: do any of these add predictive value beyond what the composite already captures, or are they redundant?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Definitions used in this analysis

Headline

Top incremental signal: Research taste Part 1 (Δ AUC +0.051 [-0.002, +0.104]).

Composite-only AUC on the Stage-3 empirical pool was 0.662 [0.587, 0.727]. The most informative additions are below — note that nominal Δ values can be small even when individually meaningful, because the composite already aggregates most of the signal.

Incremental Δ AUC (composite + signal vs. composite alone)

Added signal n AUC base AUC full Δ AUC 95% CI
CodeSignal score 676 0.677 0.688 +0.011 [-0.020, +0.039]
Research taste final 393 0.624 0.669 +0.045 [-0.005, +0.095]
Research taste Part 1 394 0.625 0.676 +0.051 [-0.002, +0.104]
Research taste Part 2 393 0.624 0.655 +0.031 [-0.010, +0.071]
ToC alignment 791 0.678 0.705 +0.028 [-0.006, +0.062]
AIS duration 470 0.662 0.676 +0.014 [-0.014, +0.043]
AIS engagement count 791 0.678 0.676 -0.002 [-0.013, +0.008]
AIS bundle (duration + count) 470 0.662 0.676 +0.013 [-0.013, +0.043]

⭐ = lower CI bound strictly above 0 (meaningful improvement on this sample).

Attribute tiers vs. composite

If we drop the composite entirely and use the raw attribute tiers (RS·relevance, MLE, SWE, Math, SS) instead, AUC = 0.709 [0.637, 0.774], vs. composite-only 0.662 [0.587, 0.727] on the same subsample (n = 480). Δ = +0.046 [-0.012, +0.109].

If the CI brackets zero, the current composite is doing as good a job aggregating attributes as any unweighted linear combination — A3 will examine specific weight choices.

CodeSignal paradox marker

Univariate AUC for CodeSignal score → is_ranked (Stage-3 empirical, n = 676): 0.606 [0.552, 0.661].

If this is above 0.5, CodeSignal selects (predicts ranking) — the original 8.0 finding. C1 closes the loop by testing whether the same predictor also tracks mentor-eval scores (the performance side). The expected 8.0 paradox is positive selection AUC + null performance correlation.

Takeaways

  1. No single signal substantially improves on the composite. The best individual addition (research-taste Part 1) buys Δ AUC ≈ +0.05 and its confidence interval just barely brackets zero. The composite is doing most of the work the other signals could in principle do.
  2. The composite is roughly as predictive as the raw attribute tiers are when used together — empirical evidence that the current 50/35/15 weighting is in a reasonable region. A3 looks at the optimal weights more directly.
  3. CodeSignal's selection-side AUC replicates the 8.0 cohort's pattern — it does predict ranking, modestly. The 8.0 paradox is that CodeSignal also fails to predict in-program performance (mentor evaluations). Part C will test whether that paradox extends to 10.0 once we have the mentor-eval data.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Stage-3 empirical pool (n=791, ranked n=147, base rate 18.6%). Same as A1's Stage-3 view. Deduped to one row per person_id. Per-signal listwise drop (see n_complete column in the results table).

Outcome variable(s). is_ranked (ranked by ≥1 stream).

Predictor fields. Base: composite alone. Added one at a time:
- codesignal_score (derived, max across multi-attempt rollup)
- Research taste test: Final, Part 1, Part 2 scores (lists collapsed to scalar via first element)
- ToC alignment score (0–100)
- AIS engagement: duration (ordinalized: 0 = No experience … 5 = >4 years) and multi-select count (sum across categories)
- Attribute tiers (replacement model): RS × relevance multiplier, MLE, SWE, Math, SS — all read as numeric

Filters applied. Stage-3 empirical filter applied (true Stage-3 applications + Empirical selected at Stage 1). Special advances and topped-ups kept (they ARE Stage-3 applicants by construction). Nanda not excluded (pool-level analysis).

Missing-data handling. Listwise-complete per model (rows with any predictor missing dropped). Reported n_complete per row in the table. For sparse predictors (e.g., research-taste test only ~400 took it), n drops substantially.

Key assumptions / caveats.