D3 — Do referee types predict mentor evaluations?

Context

B4 (Part B) showed that AI-safety-org references carry the strongest signal for predicting whether an applicant gets ranked at Stage 3. But selection and performance are different — does the same reference category also predict in-program mentor evaluations?

This analysis runs on 9.0 only — it's the only cohort with both referee categorization AND mentor evaluations. 10.0 has references but no mentor evals yet; 7.0/8.0 have mentor evals but no referee categorization.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Why this matters

If AI-safety-org references predict selection but not performance, that would be analogous to the CodeSignal paradox: the signal helps us pick people but doesn't reflect what mentors value. If it predicts both, the signal is solid and worth surfacing more prominently to reviewers.

Per-category effect on mentor composite (9.0)

Category n with n without Mean composite (with) Mean composite (without) Diff
Academia – other STEM 9 77 7.90 7.38 +0.52
AI safety org 32 54 7.65 7.31 +0.33
Academia – social science / humanities / policy 10 76 7.44 7.44 -0.00
Government / policy org 5 81 7.30 7.45 -0.15
Other industry 5 81 7.15 7.46 -0.31
AI/ML industry 21 65 7.19 7.52 -0.33
Academia – AI/ML 40 46 7.18 7.67 -0.49
Unknown 4 82 — (too small)

Does the AI-safety-org-ref flag add value beyond centralized review scores?

In a joint regression on n=59 9.0 fellows: - R² using centralized review scores alone: 0.183 - R² adding has_AI_safety_org_ref flag: 0.183 - Incremental R² from the ref flag: +0.000

If the increment is near zero, the AI-safety-org-ref signal is already absorbed by the centralized review scores (which include "AI safety motivation"). If it's meaningfully positive, the ref-category signal carries independent information about who will perform.

Takeaways

  1. AI-safety-org references show a positive lift in mentor evals in the per-category comparison — fellows with this kind of referee score modestly higher on average. But the differences are small and the n is limited.
  2. The marginal value over centralized review scores is also small — the ref-category signal is largely absorbed by what the centralized reviewers already capture.
  3. Caveat heavy: 9.0 only, joined sample is small (~80–90 fellows). Treat this as weak supporting evidence for B4, not standalone proof.
  4. For 11.0: continue surfacing the ref-category flag to Stage-3 reviewers (per B4), but don't over-weight it as a performance predictor; it's mostly carrying information the review process already extracts.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. 9.0 completed applicants joined to mentor evals via person_id. n_joined=86.

Outcome variable(s). Mean of 4 standardized mentor-eval dimensions.

Predictor fields. Categorical referee types from [9.0] Referee type (from [Ref] References (linked to 9.0 reference)). Each binary flag (has this category? — any of the 2 refs).

Filters applied. Completed apps; inner join with mentor evals; n≥5 per cell.

Missing-data handling. Per-category listwise drop (handled by per-cell ≥5 threshold).

Key assumptions / caveats.