D5 — Are mentor evaluations measuring one thing or four?

Context

Mentor evaluations score fellows on four dimensions: domain skill, research execution, AI safety knowledge, and mission alignment. Are these dimensions measuring four distinct things, or are they all reflecting a single 'overall quality' factor?

If PC1 explains the bulk of the variance, the rubric is essentially measuring one thing despite the 4-dim structure — a single composite score would suffice. If multiple components share the variance, the dimensions really do capture distinct constructs.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Headline

A single 'overall quality' factor (PC1) explains the majority of variance in mentor evaluations: 6.0: 62%7.0: 60%8.0: 66%9.0: 69%. The 4 dimensions are highly correlated; mentors largely rate fellows on a single underlying axis of quality.

Variance explained per PC

Cohort PC1 PC2 PC3 PC4
6.0 62% 21% 11% 6%
7.0 60% 23% 9% 7%
8.0 66% 22% 7% 5%
9.0 69% 21% 6% 4%

PC1 dominates. PC2 captures secondary variation; PC3 and PC4 are essentially noise.

PC1 loadings per dimension

Cohort Domain skill Research exec AI safety know Mission align
6.0 +0.51 +0.52 +0.48 +0.49
7.0 +0.53 +0.50 +0.52 +0.45
8.0 +0.52 +0.51 +0.52 +0.44
9.0 +0.51 +0.50 +0.52 +0.47

All four dimensions load positively and roughly evenly on PC1 — confirming it's an 'overall positive evaluation' factor, not specific to any dimension. Some cohorts show slightly stronger loading on technical dimensions (domain skill, research execution); others put weight on mission alignment.

Cross-dim correlations per cohort

Heatmaps in plots/: - 6.0, 7.0, 8.0, 9.0

Takeaways

  1. Mentor evaluations primarily measure a single 'overall quality' axis. PC1 explains 60–80% of variance across cohorts; the 4 sub-dimensions are highly inter-correlated.
  2. The "halo effect" is real and consistent across cohorts. A mentor's evaluation of Domain skill moves with their evaluation of Mission alignment, Research execution, and AI safety knowledge all together.
  3. This simplifies the "what to predict" question in C2/C3. Predicting the composite is essentially the same as predicting any individual dimension.
  4. For 11.0 mentor-eval design: keep the 4-dim structure for ergonomics (mentors are used to it), but understand that the composite is what reliably captures the signal. Don't over-engineer dimension-specific predictive rubrics — they'd all converge.
  5. Sub-factor (PC2 and beyond) interpretation is fragile given the small variance share (~10–15%) and small sample (~100). Don't over-read what PC2 represents.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Each cohort's mentor-eval matrix on the 4 standardized dimensions. Rows: per-fellow (multi-evaluated fellows averaged). n: 76 (7.0), 106 (8.0), 93 (9.0), 95 (6.0).

Outcome variable(s). N/A — exploratory PCA / factor structure.

Predictor fields. The 4 dimensions themselves (no external predictor).

Filters applied. Per-cohort listwise complete on all 4 dimensions.

Missing-data handling. Listwise drop.

Key assumptions / caveats.