A8 — Does our composite predict an external work-test score?

Context

Anthropic and OpenAI jointly run a Stage-3 stream called Megastream, which uses a 5-hour research-engineering takehome test to evaluate applicants. The takehome was designed independently of MATS's rubric. Because it's an external criterion — produced by a process not derived from ours — it's a rare opportunity to ask: does our composite score correlate with someone else's measure of research-engineering ability?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Context on Megastream and AFP

Megastream reviewed all ~600 Stage-3 empirical applicants and sent the takehome to ~80–112 of them; ~87 completed it. AFP (Anthropic Fellows Program) is a parallel program Anthropic runs; some MATS applicants were also considered via AFP. AFP-only applicants ('Considering via AFP' status) did not take the MATS Megastream takehome and are excluded from this analysis.

A takehome score of 0 = applicant was invited but didn't complete; these are also excluded.

Headline

Of 87 Megastream/AFP takehome takers, TE (SWE) has the strongest Spearman correlation with takehome score (ρ = +0.152, 95% CI [-0.065, +0.363]). The 10.0 composite specifically: ρ = +0.046 [-0.174, +0.263]; CodeSignal: ρ = +0.147 [-0.081, +0.364].

Spearman correlations with takehome score

Predictor n Spearman ρ 95% CI
Empirical composite 87 +0.046 [-0.174, +0.263]
CodeSignal score 87 +0.147 [-0.081, +0.364]
RS · relevance 43 -0.072 [-0.341, +0.237]
TE (MLE) 87 +0.051 [-0.146, +0.254]
TE (SWE) 87 +0.152 [-0.065, +0.363]
TE (Math) 87 -0.169 [-0.371, +0.045]
Soft skills 87 -0.089 [-0.285, +0.116]

AUC for predicting above-median takehome score

Predictor n AUC 95% CI
Empirical composite 87 0.438 [0.317, 0.564]
CodeSignal score 87 0.604 [0.481, 0.726]
RS · relevance 43 0.419 [0.262, 0.605]
TE (MLE) 87 0.564 [0.457, 0.677]
TE (SWE) 87 0.580 [0.463, 0.695]
TE (Math) 87 0.337 [0.233, 0.456]
Soft skills 87 0.391 [0.279, 0.498]

Composite vs. takehome (scatter)

Status breakdown

ms_status n th_mean th_median composite_mean ranked_p
-> Offer 6 31.83 32.00 2.87 1.00
Reject 1 26.00 26.00 2.95 0.00
Reject (post TH) 80 22.18 22.25 2.72 0.29

Takeaways

  1. The 10.0 composite shows essentially no correlation with the Megastream takehome score — point estimate +0.05, CI brackets zero. This is the cleanest external-validity check we have, and it comes back null.
  2. CodeSignal does slightly better (ρ = +0.15) — modest but in the right direction. Consistent with the 8.0 paradox: CodeSignal captures something narrowly technical that other engineering tests also capture.
  3. Caveat — small n and severe range restriction. All 87 takehome takers cleared multiple selection bars to get there; variance in their composite scores is squeezed. Population-level correlations would likely be higher.
  4. Read this in conjunction with Part C (mentor-eval validation). The pattern of "composite predicts selection but not external performance" is what 8.0 showed for CodeSignal specifically. If the same shape shows up in mentor evaluations in Part C, that's a much stronger signal than this single under-powered analysis.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Megastream/AFP takehome takers — applicants whose Takehome Score is non-null AND not 0.0 (per memory: 0.0 = invited but didn't complete). n = 87. AFP-only applicants ('Considering via AFP') are excluded since they didn't take a MATS takehome.

Outcome variable(s). Takehome Score (from [s] AFP -> MATS Decisions) — list cell, collapsed to scalar via first element, coerced numeric. Continuous score; also binarized at median for AUC analysis.

Predictor fields. 10.0 composite + attribute components (RS·rel, MLE, SWE, Math, SS) + CodeSignal score. All numeric; same parsing as A1–A3.

Filters applied. Canonical 10.0 sample (deduped). Subsample = TH takers as defined above. Per-predictor listwise drop.

Missing-data handling. Per-predictor listwise drop. Reported as n per row.

Key assumptions / caveats.