A8 — Megastream takehome validation

Context

Anthropic and OpenAI jointly run a Stage-3 stream called Megastream, which uses a 5-hour research-engineering takehome test to evaluate applicants. The takehome was designed independently of MATS's rubric. Because it's an external criterion — produced by a process not derived from ours — it's a rare opportunity to ask: does our composite score correlate with someone else's measure of research-engineering ability?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Context on Megastream and AFP

Megastream reviewed all ~600 Stage-3 empirical applicants and sent the takehome to ~80–112 of them; ~87 completed it. AFP (Anthropic Fellows Program) is a parallel program Anthropic runs; some MATS applicants were also considered via AFP. AFP-only applicants ('Considering via AFP' status) did not take the MATS Megastream takehome and are excluded from this analysis.

A takehome score of 0 = applicant was invited but didn't complete; these are also excluded.

Of 87 Megastream/AFP takehome takers, TE (SWE) has the strongest Spearman correlation with takehome score (ρ = +0.152, 95% CI [-0.065, +0.363]). The 10.0 composite specifically: ρ = +0.046 [-0.174, +0.263]; CodeSignal: ρ = +0.147 [-0.081, +0.364].

Predictor	n	Spearman ρ	95% CI
Empirical composite	87	+0.046	[-0.174, +0.263]
CodeSignal score	87	+0.147	[-0.081, +0.364]
RS · relevance	43	-0.072	[-0.341, +0.237]
TE (MLE)	87	+0.051	[-0.146, +0.254]
TE (SWE)	87	+0.152	[-0.065, +0.363]
TE (Math)	87	-0.169	[-0.371, +0.045]
Soft skills	87	-0.089	[-0.285, +0.116]

Predictor	n	AUC	95% CI
Empirical composite	87	0.438	[0.317, 0.564]
CodeSignal score	87	0.604	[0.481, 0.726]
RS · relevance	43	0.419	[0.262, 0.605]
TE (MLE)	87	0.564	[0.457, 0.677]
TE (SWE)	87	0.580	[0.463, 0.695]
TE (Math)	87	0.337	[0.233, 0.456]
Soft skills	87	0.391	[0.279, 0.498]

ms_status	n	th_mean	th_median	composite_mean	ranked_p
-> Offer	6	31.83	32.00	2.87	1.00
Reject	1	26.00	26.00	2.95	0.00
Reject (post TH)	80	22.18	22.25	2.72	0.29

A8 — Does our composite predict an external work-test score?

Context

Headline

Spearman correlations with takehome score

AUC for predicting above-median takehome score

Composite vs. takehome (scatter)

Status breakdown

Takeaways