C2 — Predicting mentor evals

Context

If we want to select for fellows who will perform well, the right benchmark is mentor evaluations during the program — how mentors rate fellows on domain skill, research execution, AI safety knowledge, and mission alignment. Which application-time features actually predict mentor-eval scores?

We run this analysis per cohort (7.0, 8.0, 9.0), since the available application features differ across cohorts. 10.0 has no mentor-eval data yet (program in progress).

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Why this differs from C1

C1 focused on one feature (CodeSignal) across cohorts. C2 looks at the full set of application features per cohort. Each cohort's feature set is different — 7.0 had limited features, 8.0/9.0 added centralized review scores. We can't pool across cohorts cleanly.

What mentor-eval composite is

Mean of four standardized dimensions: domain skill, research execution, AI safety knowledge, mission alignment. Each fellow has one composite (averaged if multi-evaluated). Range roughly 1–10.

Model fit per cohort

R² is small in every cohort. Application features explain <25% of variance in mentor evals, consistent with the prior 8.0 validation summary finding (R² < 0.25). Selection is hard, and a lot of in-program performance is determined by post-application factors (mentor fit, project, etc.).

Standardized coefficients per cohort

Bivariate Spearman ρ (each feature vs. mentor composite)

This view is independent of multicollinearity — useful for spotting features that have signal but get drowned out in the joint regression.

Cohort	n (in regression)	R²
7.0	57	0.077
8.0	67	0.339
9.0	55	0.264

C2 — What application features predict mentor-eval scores?

Context

Model fit per cohort

Standardized coefficients per cohort

Bivariate Spearman ρ (each feature vs. mentor composite)

Takeaways