A5 — Research taste validation

Context

Many of the 10.0 empirical streams opted in to a research taste test at Stage 3: a structured exercise (the 'nanobot scenario') in which applicants evaluate a hypothetical research direction. Output: a final score, two part-scores (Part 1 and Part 2), and a tier label (Exceeds / Meets / Near / Below expectations, plus 'Cheated' for applicants flagged for misuse of AI tools). About 394 of the ~600 Stage-3 empirical applicants took it (it was optional per stream). Does it predict who gets ranked, and does it add signal beyond the composite score?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Headline

Research-taste final score → is_ranked: AUC = 0.634 [0.570, 0.694] (n = 393). On the same subsample, the empirical composite gets AUC = 0.624 [0.557, 0.690]. Part 1 (0.634) edges Part 2 (0.607).

AUC for predicting is_ranked

Outcomes by tier label

A monotone tier→ranked relationship suggests the tier labels carry signal beyond the underlying final score. If the order is non-monotone (e.g., 'Near' below 'Below'), the rubric or labelers may not be consistent.

Selection effect — who took the test?

If the takers' composite mean is much higher than non-takers, opt-in selection inflates apparent AUC (selecting on the dependent variable's correlate). Calibrate any conclusion against this.

Research taste vs. composite

Predictor	n	AUC	95% CI
Research taste final	393	0.634	[0.570, 0.694]
Research taste Part 1	393	0.634	[0.575, 0.689]
Research taste Part 2	393	0.607	[0.544, 0.668]
Empirical composite (same subsample)	393	0.624	[0.557, 0.690]

Tier	n	Ranked	Invited	Offered/WL	Mean final score
Exceeds expectations	55	26/55 (47%)	42/55 (76%)	26/55 (47%)	64.38
Meets expectations	173	49/173 (28%)	108/173 (62%)	49/173 (28%)	53.17
Near expectations	91	19/91 (21%)	50/91 (55%)	19/91 (21%)	45.89
Below expectations	50	6/50 (12%)	21/50 (42%)	6/50 (12%)	34.62
Cheated	24	3/24 (12%)	10/24 (42%)	3/24 (12%)	64.24

Group	n	Mean composite	Median composite	P(ranked)
Took research taste test	393	2.68	2.71	26%
Did NOT take research taste test (Stage 3 empirical)	398	2.25	2.39	11%

Spearman ρ between research-taste final and empirical composite = 0.030. If small, the two carry orthogonal signal — combining them should improve over either alone (A2 also looked at this).

A5 — Does the research-taste work test predict rankings?

Context

Headline

AUC for predicting `is_ranked`

Outcomes by tier label

Selection effect — who took the test?

Research taste vs. composite

Takeaways

A5 — Does the research-taste work test predict rankings?

Context

Headline

AUC for predicting is_ranked

Outcomes by tier label

Selection effect — who took the test?

Research taste vs. composite

Takeaways

AUC for predicting `is_ranked`