C1 — Does the CodeSignal paradox replicate across cohorts?

Context

The 'CodeSignal paradox' was the most-cited finding from the 8.0 cohort: applicants' CodeSignal scores predicted who got accepted to MATS reasonably well (selection-side AUC ≈ 0.77), but did NOT predict mentor evaluations of their in-program performance (correlation ≈ 0). Implication: CodeSignal selects for something MATS evaluators value at the application stage, but that something doesn't translate to actual research-engineering output during the fellowship.

Does the paradox replicate? We test it across cohorts 7.0, 8.0, and 9.0 (the cohorts that have both CodeSignal AND mentor evaluations). 10.0's mentor-eval data doesn't exist yet (program just started).

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

CodeSignal test differences across cohorts

7.0–9.0 used the same CodeSignal Industry Coding Assessment (ICA). 10.0 used a custom variant ('MATS Chatbot Service'). For C1, all three compared cohorts used the same test, so raw scores are comparable.

Mentor eval dimensions

Each mentor eval scored fellows on:

7.0 and 8.0 computed a composite of these (mean 6.98–7.10 across cohorts); 9.0 didn't compute a composite, so we synthesize one as the mean of the 4 dimensions.

Headline

Yes — both halves of the paradox show up in 7.0, 8.0, and 9.0.

CodeSignal is selecting for something that MATS evaluators (and reviewers) value enough to admit applicants — but that something doesn't predict whether mentors think the fellow performed well during the program.

Selection-side AUC

Cohort n Passed bar AUC 95% CI
7.0 499 60 0.709 [0.634, 0.778]
8.0 894 72 0.703 [0.634, 0.762]
9.0 924 117 0.783 [0.743, 0.821]

Performance-side Spearman ρ

Cohort n (in mentor eval) Spearman ρ (CodeSignal vs mentor composite)
7.0 57 -0.106
8.0 70 -0.184
9.0 71 +0.022

Per-dimension correlations

Does CodeSignal correlate with any individual mentor-eval dimension, even if not with the composite?

Cohort Domain skill Research exec AI safety know Mission align
7.0 -0.06 -0.09 -0.18 -0.29
8.0 -0.12 -0.08 -0.33 -0.33
9.0 +0.08 -0.04 +0.05 -0.14

Even at the per-dimension level, correlations are small and inconsistent across cohorts. There's no single dimension where CodeSignal reliably predicts mentor-eval scores.

Takeaways

  1. The 8.0 paradox is not a 8.0-specific artifact. It replicates across 7.0 and 9.0 with the same shape.
  2. CodeSignal does predict selection with meaningful AUC. Either reviewers actively look at it (likely — it was a Stage-2 input in 7.0–9.0), or it correlates with other signals reviewers use.
  3. CodeSignal does NOT predict mentor evaluations of fellows during the program. After three cohorts of evidence, this is a robust finding.
  4. For 11.0: this is significant evidence against using CodeSignal as a Stage-2 score input. The current proposal to drop or de-emphasize it in 11.0 is supported. If a coding test is kept at all, it should be justified by something other than mentor-eval-prediction.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Per cohort: completed applicants joined to mentor-eval rows via person_id. Multi-evaluated fellows averaged. 7.0: n_completers=878, mentor n=76. 8.0: n_completers=1,454, mentor n=106. 9.0: n_completers=1,296, mentor n=93.

Outcome variable(s). Selection-side: passed_mentors_bar (accepted-to-MATS proxy for 6.0/7.0/8.0; true offer data for 9.0). Performance-side: mentor evaluation composite score (7.0 and 8.0) or mean of standardized dimensions (9.0; no composite computed).

Predictor fields. CodeSignal score as raw numeric ('cheated' → 0). 7.0/8.0 use CodeSignal score; 9.0 uses CodeSignal score numerical.

Filters applied. Completed applications only. Inner join applicant ↔ mentor on person_id.

Missing-data handling. Listwise drop on each correlation/AUC.

Key assumptions / caveats.