Anthropic and OpenAI jointly run a Stage-3 stream called Megastream, which uses a 5-hour research-engineering takehome test to evaluate applicants. The takehome was designed independently of MATS's rubric. Because it's an external criterion — produced by a process not derived from ours — it's a rare opportunity to ask: does our composite score correlate with someone else's measure of research-engineering ability?
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).Megastream reviewed all ~600 Stage-3 empirical applicants and sent the takehome to ~80–112 of them; ~87 completed it. AFP (Anthropic Fellows Program) is a parallel program Anthropic runs; some MATS applicants were also considered via AFP. AFP-only applicants ('Considering via AFP' status) did not take the MATS Megastream takehome and are excluded from this analysis.
A takehome score of 0 = applicant was invited but didn't complete; these are also excluded.
Of 87 Megastream/AFP takehome takers, TE (SWE) has the strongest Spearman correlation with takehome score (ρ = +0.152, 95% CI [-0.065, +0.363]). The 10.0 composite specifically: ρ = +0.046 [-0.174, +0.263]; CodeSignal: ρ = +0.147 [-0.081, +0.364].
| Predictor | n | Spearman ρ | 95% CI |
|---|---|---|---|
| Empirical composite | 87 | +0.046 | [-0.174, +0.263] |
| CodeSignal score | 87 | +0.147 | [-0.081, +0.364] |
| RS · relevance | 43 | -0.072 | [-0.341, +0.237] |
| TE (MLE) | 87 | +0.051 | [-0.146, +0.254] |
| TE (SWE) | 87 | +0.152 | [-0.065, +0.363] |
| TE (Math) | 87 | -0.169 | [-0.371, +0.045] |
| Soft skills | 87 | -0.089 | [-0.285, +0.116] |
| Predictor | n | AUC | 95% CI |
|---|---|---|---|
| Empirical composite | 87 | 0.438 | [0.317, 0.564] |
| CodeSignal score | 87 | 0.604 | [0.481, 0.726] |
| RS · relevance | 43 | 0.419 | [0.262, 0.605] |
| TE (MLE) | 87 | 0.564 | [0.457, 0.677] |
| TE (SWE) | 87 | 0.580 | [0.463, 0.695] |
| TE (Math) | 87 | 0.337 | [0.233, 0.456] |
| Soft skills | 87 | 0.391 | [0.279, 0.498] |
| ms_status | n | th_mean | th_median | composite_mean | ranked_p |
|---|---|---|---|---|---|
| -> Offer | 6 | 31.83 | 32.00 | 2.87 | 1.00 |
| Reject | 1 | 26.00 | 26.00 | 2.95 | 0.00 |
| Reject (post TH) | 80 | 22.18 | 22.25 | 2.72 | 0.29 |
Sample. Megastream/AFP takehome takers — applicants whose Takehome Score is non-null AND not 0.0 (per memory: 0.0 = invited but didn't complete). n = 87. AFP-only applicants ('Considering via AFP') are excluded since they didn't take a MATS takehome.
Outcome variable(s). Takehome Score (from [s] AFP -> MATS Decisions) — list cell, collapsed to scalar via first element, coerced numeric. Continuous score; also binarized at median for AUC analysis.
Predictor fields. 10.0 composite + attribute components (RS·rel, MLE, SWE, Math, SS) + CodeSignal score. All numeric; same parsing as A1–A3.
Filters applied. Canonical 10.0 sample (deduped). Subsample = TH takers as defined above. Per-predictor listwise drop.
Missing-data handling. Per-predictor listwise drop. Reported as n per row.
Key assumptions / caveats.