B4 — Do references (and their type) predict ranking?

Context

Each 10.0 applicant could list up to two references. Each reference was categorized (using an AI labeler) into one of eight types — AI safety org, AI/ML industry, Academia – AI/ML, Academia – other STEM, Academia – social science / humanities / policy, Government / policy org, Other industry, or Unknown. Does having references predict ranking? And do specific kinds of references — e.g., from AI safety orgs — help more?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

How references are used in selection

References are collected at Stage 2 (confirmed for 11.0 as well). Reviewers see reference content during Stage 3. The categorization itself is a Sanyu-side AI labeling step — it doesn't directly feed into a Stage-1/2 score, but is available for stream reviewers and for analyses like this.

Q1 — Number of references

# refs categorized n n ranked P(ranked)
0 412 3 0.7%
1 410 27 6.6%
2 1,381 159 11.5%

In the full pool, applicants with 2 categorized references rank at a higher rate than those with 0–1. But this is heavily confounded by who submits references at all — applicants who reach Stage 2 and finish a full application reliably submit references; applicants filtered earlier often don't.

Q2 — Type of reference (full pool)

| Category | n with | n ranked | P(ranked|has) | P(ranked|hasn't) | Lift | |---|---|---|---|---|---| | AI safety org | 250 | 49 | 19.6% | 7.2% | +12.4% | | Academia – AI/ML | 772 | 87 | 11.3% | 7.1% | +4.1% | | Government / policy org | 147 | 18 | 12.2% | 8.3% | +3.9% | | Unknown | 76 | 9 | 11.8% | 8.5% | +3.4% | | Academia – social science / humanities / policy | 223 | 23 | 10.3% | 8.4% | +1.9% | | AI/ML industry | 324 | 33 | 10.2% | 8.3% | +1.9% | | Academia – other STEM | 444 | 43 | 9.7% | 8.3% | +1.4% | | Other industry | 296 | 13 | 4.4% | 9.2% | -4.8% |

Lift = how much higher (or lower) the ranking rate is for applicants who have this kind of referee, vs. applicants who don't. AI-safety-org references and Academia – AI/ML references both show positive lift in the raw data. Government / policy / other-industry references show negative lift.

Q3 — Reference signal beyond other features (Stage-3 empirical)

In a logistic regression restricted to Stage-3 empirical applicants, the strongest individual reference predictor is has_AI safety org (standardized coef = +0.26). Univariate AUC of has_AI safety orgis_ranked on this subsample = 0.568 [0.533, 0.607]. Full-model AUC (all category flags + n_refs) = 0.629 [0.578, 0.674].

Takeaways

  1. AI-safety-org references carry the strongest signal. Applicants with at least one referee in this bucket rank at a meaningfully higher rate — the lift is the biggest of any category.
  2. Government / policy / industry references skew negative in the full pool, partly because they cluster with applicants in tracks (P&S/TG) that have lower overall ranking rates, but also because they may signal a less-AIS-focused profile.
  3. The signal isn't enormous beyond what other features already capture. Stage-3 logistic AUC from references alone is ~0.63 — useful but secondary to attributes/composite.
  4. For 11.0: Keep the categorization. The 'AI safety org' flag is the cheapest possible "is this person in the field?" indicator and it carries weight. Worth surfacing more prominently to Stage-3 reviewers.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Full pool (n=2,203) and Stage-3 empirical (n=791) for the regression.

Outcome variable(s). is_ranked.

Predictor fields. [10.0] Referee type (from [ref 1] Reference link) and [10.0] Referee type (from [ref 2] Reference link) — JSON-list cells flattened to scalar category. Per-category binary flags + total ref count.

Filters applied. Canonical dedup.

Missing-data handling. None category treated as 'no reference'.

Key assumptions / caveats.