Many of the 10.0 empirical streams opted in to a research taste test at Stage 3: a structured exercise (the 'nanobot scenario') in which applicants evaluate a hypothetical research direction. Output: a final score, two part-scores (Part 1 and Part 2), and a tier label (Exceeds / Meets / Near / Below expectations, plus 'Cheated' for applicants flagged for misuse of AI tools). About 394 of the ~600 Stage-3 empirical applicants took it (it was optional per stream). Does it predict who gets ranked, and does it add signal beyond the composite score?
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).Research-taste final score → is_ranked: AUC = 0.634 [0.570, 0.694] (n = 393). On the same subsample, the empirical composite gets AUC = 0.624 [0.557, 0.690]. Part 1 (0.634) edges Part 2 (0.607).
is_ranked| Predictor | n | AUC | 95% CI |
|---|---|---|---|
| Research taste final | 393 | 0.634 | [0.570, 0.694] |
| Research taste Part 1 | 393 | 0.634 | [0.575, 0.689] |
| Research taste Part 2 | 393 | 0.607 | [0.544, 0.668] |
| Empirical composite (same subsample) | 393 | 0.624 | [0.557, 0.690] |
| Tier | n | Ranked | Invited | Offered/WL | Mean final score |
|---|---|---|---|---|---|
| Exceeds expectations | 55 | 26/55 (47%) | 42/55 (76%) | 26/55 (47%) | 64.38 |
| Meets expectations | 173 | 49/173 (28%) | 108/173 (62%) | 49/173 (28%) | 53.17 |
| Near expectations | 91 | 19/91 (21%) | 50/91 (55%) | 19/91 (21%) | 45.89 |
| Below expectations | 50 | 6/50 (12%) | 21/50 (42%) | 6/50 (12%) | 34.62 |
| Cheated | 24 | 3/24 (12%) | 10/24 (42%) | 3/24 (12%) | 64.24 |
A monotone tier→ranked relationship suggests the tier labels carry signal beyond the underlying final score. If the order is non-monotone (e.g., 'Near' below 'Below'), the rubric or labelers may not be consistent.
| Group | n | Mean composite | Median composite | P(ranked) |
|---|---|---|---|---|
| Took research taste test | 393 | 2.68 | 2.71 | 26% |
| Did NOT take research taste test (Stage 3 empirical) | 398 | 2.25 | 2.39 | 11% |
If the takers' composite mean is much higher than non-takers, opt-in selection inflates apparent AUC (selecting on the dependent variable's correlate). Calibrate any conclusion against this.
Spearman ρ between research-taste final and empirical composite = 0.030. If small, the two carry orthogonal signal — combining them should improve over either alone (A2 also looked at this).
Sample. Stage-3 empirical pool (n=791). Of these, 393 took the research-taste test and 398 did not. Primary AUC analysis is over test-takers only. Selection-effect comparison reports composite distribution by took-vs-didn't.
Outcome variable(s). is_ranked (primary). Tier-level summary also reports invited/offered rates.
Predictor fields. Final score (from [10.0] Research taste work test results) (final score, list→scalar), Part 1, Part 2 scores. Tier label as [Exceeds | Meets | Near | Below expectations | Cheated] — per project memory, cheaters retained in Stage 3 analysis.
Filters applied. Stage-3 empirical filter applied. Test-takers determined by non-null Final score. Listwise drop for AUC.
Missing-data handling. Listwise drop on each predictor individually for its AUC row.
Key assumptions / caveats.