C2 — What application features predict mentor-eval scores?

Context

If we want to select for fellows who will perform well, the right benchmark is mentor evaluations during the program — how mentors rate fellows on domain skill, research execution, AI safety knowledge, and mission alignment. Which application-time features actually predict mentor-eval scores?

We run this analysis per cohort (7.0, 8.0, 9.0), since the available application features differ across cohorts. 10.0 has no mentor-eval data yet (program in progress).

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Why this differs from C1

C1 focused on one feature (CodeSignal) across cohorts. C2 looks at the full set of application features per cohort. Each cohort's feature set is different — 7.0 had limited features, 8.0/9.0 added centralized review scores. We can't pool across cohorts cleanly.

What mentor-eval composite is

Mean of four standardized dimensions: domain skill, research execution, AI safety knowledge, mission alignment. Each fellow has one composite (averaged if multi-evaluated). Range roughly 1–10.

Model fit per cohort

Cohort n (in regression)
7.0 57 0.077
8.0 67 0.339
9.0 55 0.264

R² is small in every cohort. Application features explain <25% of variance in mentor evals, consistent with the prior 8.0 validation summary finding (R² < 0.25). Selection is hard, and a lot of in-program performance is determined by post-application factors (mentor fit, project, etc.).

Standardized coefficients per cohort

Bivariate Spearman ρ (each feature vs. mentor composite)

This view is independent of multicollinearity — useful for spotting features that have signal but get drowned out in the joint regression.

Takeaways

  1. Application features explain <25% of mentor-eval variance. This is consistent across cohorts and matches the prior 8.0 finding. Selection from applications is fundamentally hard.
  2. No application-time feature is consistently a strong predictor across all three cohorts. Patterns differ — e.g., centralized "Research independence" looks meaningful in 8.0/9.0 but isn't measured at 7.0.
  3. CodeSignal's coefficient is small or negative across cohorts — same conclusion as C1 from a different angle.
  4. For 11.0: the rubric will produce noisy predictions of in-program performance no matter what we do. The best we can hope for is correctly identifying the bottom of the distribution (people unlikely to succeed) — fine-grained ranking within the top quintile is largely guesswork given how much in-program performance depends on factors outside the application.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Per cohort: completed applications joined to mentor-eval rows (mean composite per person if multi-evaluated). Listwise-complete on the cohort's feature set: 7.0 n=57, 8.0 n=67, 9.0 n=55.

Outcome variable(s). Mean of standardized mentor-eval dimensions per fellow (domain skill, research execution, AI safety knowledge, mission alignment).

Predictor fields. Per cohort: CodeSignal score, ordinal education, # bg-review tier items (Familiar/Applied/Expert). 8.0+: centralized review scores (research independence, publication record, technical execution, AI safety motivation).

Filters applied. Completed applications + cohort-specific listwise-complete on features.

Missing-data handling. Listwise drop. Small effective n (20–60) for the regression; CIs not computed for individual coefficients (sample-size-limited).

Key assumptions / caveats.