Mentor evaluations score fellows on four dimensions: domain skill, research execution, AI safety knowledge, and mission alignment. Are these dimensions measuring four distinct things, or are they all reflecting a single 'overall quality' factor?
If PC1 explains the bulk of the variance, the rubric is essentially measuring one thing despite the 4-dim structure — a single composite score would suffice. If multiple components share the variance, the dimensions really do capture distinct constructs.
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).A single 'overall quality' factor (PC1) explains the majority of variance in mentor evaluations: 6.0: 62% — 7.0: 60% — 8.0: 66% — 9.0: 69%. The 4 dimensions are highly correlated; mentors largely rate fellows on a single underlying axis of quality.
| Cohort | PC1 | PC2 | PC3 | PC4 |
|---|---|---|---|---|
| 6.0 | 62% | 21% | 11% | 6% |
| 7.0 | 60% | 23% | 9% | 7% |
| 8.0 | 66% | 22% | 7% | 5% |
| 9.0 | 69% | 21% | 6% | 4% |
PC1 dominates. PC2 captures secondary variation; PC3 and PC4 are essentially noise.
| Cohort | Domain skill | Research exec | AI safety know | Mission align |
|---|---|---|---|---|
| 6.0 | +0.51 | +0.52 | +0.48 | +0.49 |
| 7.0 | +0.53 | +0.50 | +0.52 | +0.45 |
| 8.0 | +0.52 | +0.51 | +0.52 | +0.44 |
| 9.0 | +0.51 | +0.50 | +0.52 | +0.47 |
All four dimensions load positively and roughly evenly on PC1 — confirming it's an 'overall positive evaluation' factor, not specific to any dimension. Some cohorts show slightly stronger loading on technical dimensions (domain skill, research execution); others put weight on mission alignment.
Heatmaps in plots/:
- 6.0, 7.0, 8.0, 9.0
Sample. Each cohort's mentor-eval matrix on the 4 standardized dimensions. Rows: per-fellow (multi-evaluated fellows averaged). n: 76 (7.0), 106 (8.0), 93 (9.0), 95 (6.0).
Outcome variable(s). N/A — exploratory PCA / factor structure.
Predictor fields. The 4 dimensions themselves (no external predictor).
Filters applied. Per-cohort listwise complete on all 4 dimensions.
Missing-data handling. Listwise drop.
Key assumptions / caveats.