A5 — Does the research-taste work test predict rankings?

Context

Many of the 10.0 empirical streams opted in to a research taste test at Stage 3: a structured exercise (the 'nanobot scenario') in which applicants evaluate a hypothetical research direction. Output: a final score, two part-scores (Part 1 and Part 2), and a tier label (Exceeds / Meets / Near / Below expectations, plus 'Cheated' for applicants flagged for misuse of AI tools). About 394 of the ~600 Stage-3 empirical applicants took it (it was optional per stream). Does it predict who gets ranked, and does it add signal beyond the composite score?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Headline

Research-taste final score → is_ranked: AUC = 0.634 [0.570, 0.694] (n = 393). On the same subsample, the empirical composite gets AUC = 0.624 [0.557, 0.690]. Part 1 (0.634) edges Part 2 (0.607).

AUC for predicting is_ranked

Predictor n AUC 95% CI
Research taste final 393 0.634 [0.570, 0.694]
Research taste Part 1 393 0.634 [0.575, 0.689]
Research taste Part 2 393 0.607 [0.544, 0.668]
Empirical composite (same subsample) 393 0.624 [0.557, 0.690]

Outcomes by tier label

Tier n Ranked Invited Offered/WL Mean final score
Exceeds expectations 55 26/55 (47%) 42/55 (76%) 26/55 (47%) 64.38
Meets expectations 173 49/173 (28%) 108/173 (62%) 49/173 (28%) 53.17
Near expectations 91 19/91 (21%) 50/91 (55%) 19/91 (21%) 45.89
Below expectations 50 6/50 (12%) 21/50 (42%) 6/50 (12%) 34.62
Cheated 24 3/24 (12%) 10/24 (42%) 3/24 (12%) 64.24

A monotone tier→ranked relationship suggests the tier labels carry signal beyond the underlying final score. If the order is non-monotone (e.g., 'Near' below 'Below'), the rubric or labelers may not be consistent.

Selection effect — who took the test?

Group n Mean composite Median composite P(ranked)
Took research taste test 393 2.68 2.71 26%
Did NOT take research taste test (Stage 3 empirical) 398 2.25 2.39 11%

If the takers' composite mean is much higher than non-takers, opt-in selection inflates apparent AUC (selecting on the dependent variable's correlate). Calibrate any conclusion against this.

Research taste vs. composite

Spearman ρ between research-taste final and empirical composite = 0.030. If small, the two carry orthogonal signal — combining them should improve over either alone (A2 also looked at this).

Takeaways

  1. The research-taste test predicts ranking about as well as the composite does on the same sub-sample (AUC ≈ 0.63 either way). It's not adding dramatic incremental value, but it's also not strictly redundant — the two predictors are imperfectly correlated.
  2. Tier labels work — applicants graded 'Exceeds expectations' have ~4× the ranking rate of 'Below expectations'. The labels are usable as a categorical summary.
  3. 'Cheated' applicants got high scores but didn't get ranked — streams looked past the score when there was evidence of AI misuse. The test's grading-tier system caught this signal.
  4. The opt-in selection effect is real: takers had higher composites on average than non-takers, so the apparent AUC overstates the population effect. For 11.0, mandating the test (or selecting takers more deliberately) would help with this.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Stage-3 empirical pool (n=791). Of these, 393 took the research-taste test and 398 did not. Primary AUC analysis is over test-takers only. Selection-effect comparison reports composite distribution by took-vs-didn't.

Outcome variable(s). is_ranked (primary). Tier-level summary also reports invited/offered rates.

Predictor fields. Final score (from [10.0] Research taste work test results) (final score, list→scalar), Part 1, Part 2 scores. Tier label as [Exceeds | Meets | Near | Below expectations | Cheated] — per project memory, cheaters retained in Stage 3 analysis.

Filters applied. Stage-3 empirical filter applied. Test-takers determined by non-null Final score. Listwise drop for AUC.

Missing-data handling. Listwise drop on each predictor individually for its AUC row.

Key assumptions / caveats.