D4 — Work test score distributions & sub-score relationships

Context

10.0 used two work tests at Stage 3: a research-taste test for the empirical track (the nanobot-scenario test, ~394 takers) and a policy/governance writing test (~82 takers). We've already looked at predictive validity (A5 / B5). This analysis is more granular: how do the score distributions actually look, are the sub-scores (Part 1 / Part 2) redundant, and is the overall policy score doing more than the grading-tier label?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Research-taste final score: ranked vs not

Ranked n Mean Median P25 P75
No 291 50.4 50.4 44.7 55.2
Yes 103 54.2 54.1 49.9 60.1

Ranked applicants score visibly higher on average, with substantial overlap.

Research-taste Part 1 vs Part 2

Spearman ρ between Part 1 and Part 2 scores = +0.64 (n = 394).

If ρ is high (>0.7), the two parts measure essentially the same thing — one of them could be dropped. If low-to-moderate (<0.5), they capture different aspects of research taste and both are worth keeping.

Each part's correlation with the empirical composite: Part 1 = +0.02, Part 2 = +0.05. Both modest.

Policy/gov work test: overall score by ranked status

Ranked n Mean overall Median overall
No 66 3.2 3.5
Yes 16 3.2 3.5

The grading field for the policy/gov work test isn't a clean categorical tier — it's a generated rationale dict per response. So we look at the overall numeric score by ranked status instead. Spearman ρ between policy/gov work-test score and the empirical composite (descriptive only — most takers are policy/gov track) = -0.09.

Takeaways

  1. Research-taste Part 1 and Part 2 are partially redundant (ρ ≈ +0.64). Not enough to drop one outright, but suggests the rubric could be tightened — there's overlap between what the two parts grade.
  2. Both research-taste parts correlate weakly with composite (≈+0.03). The test is capturing something the composite doesn't, even if neither part is dramatically predictive on its own.
  3. Policy/gov work test score visibly higher for ranked applicants, with the small-sample caveat (n=82).
  4. For 11.0: consider whether the research-taste test could be one part instead of two — the marginal informational value of Part 2 over Part 1 looks limited. Free up applicant time without losing much signal.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Research-taste takers: n=394. Policy/gov work test takers: n=82.

Outcome variable(s). Score distributions (descriptive) + correlation with composite.

Predictor fields. Work-test scores.

Filters applied. Canonical dedup. Per-test sample = takers only.

Missing-data handling. Listwise drop on score columns.

Key assumptions / caveats.