Part C — Cross-cohort validation & discovery (7.0–10.0)

Context

Parts A and B looked at 10.0 alone. Part C uses the multi-cohort dataset (6.0 through 10.0, plus alumni outcome data) to validate findings, test for trends, and ask discovery questions we can't answer from 10.0 alone. 8 analyses across cohorts.

Cohort timeline (click to expand)

6.0 (summer 2024) — fully decentralized stream review. Different rubric than later cohorts.
7.0 (autumn 2024) — fully decentralized. CodeSignal added.
8.0 (summer 2025) — fully decentralized. The "CodeSignal paradox" was first documented here.
9.0 (autumn 2025) — partial centralized review on top of decentralized stream review. Added centralized review scores (research independence, publication record, technical execution, AI safety motivation).
10.0 (summer 2026, in progress) — fully centralized review at Stages 1 & 2. New composite-score rubric. Mentor evals don't exist yet for 10.0.

Headline findings

The "CodeSignal paradox" replicates across cohorts 7.0, 8.0, and 9.0. CodeSignal predicts who passes the bar (AUC 0.70–0.78) but does NOT predict mentor evaluations of in-program performance (ρ ≈ 0 across cohorts). Three cohorts of evidence. C1 results.
Application features explain <25% of mentor-eval variance. Across 7.0/8.0/9.0, application-time predictors hit R² = 0.08 (7.0), 0.34 (8.0), 0.26 (9.0). Selection from applications is fundamentally noisy. C2 results.
Centralized "publication record" review carries consistent signal for predicting both mentor evals (C2) and post-program publications (C3, in-sample AUCs 0.72–0.78 across cohorts). The clearest "keep this signal" finding. C3 results.
Mentor-eval distributions are remarkably stable across 6.0/7.0/8.0/9.0. Mean composite hovers at 7.2–7.5/10; fraction "high quality" (≥8) sits at 25–35% across cohorts. The shift from decentralized to partially-centralized in 9.0 didn't produce a visible quality bump. C4 results.
The simplest version of 10.0's LLM tier-sort is NOT as predictive as 9.0's centralized review when applied retrospectively (R² ~0.05 vs ~0.17). But the comparison is unfair — only tier counts (not the full attribute-score aggregation) were re-run on 8.0/9.0. A useful follow-up: re-run the full 10.0 rubric on past resumes. C5 results.
Demographic outcomes are broadly comparable across cohorts. No dramatic shift in gender × outcome rates with the 10.0 centralization. Pool composition is slowly diversifying. Race data limited by small subgroup n and missingness. C6 results.
Pool size has roughly doubled (878 in 7.0 → 2,210 in 10.0) with modest education-composition shifts (slightly more Bachelor's-only applicants over time). C7 results.
Alumni survey (n=123, opt-in): most respondents work at AI safety orgs; publications and post-MATS funding are common. Most-attributed MATS-impact mechanisms are connections and resume strengthening — not direct skill development. C8 results.

11.0 implications (tentative)

Strong evidence against CodeSignal as a selection criterion. Three cohorts agree it predicts admission but not performance (C1). The current 11.0 proposal to drop or de-emphasize it is well-supported.
Keep the "publication record" rubric prompt. Its consistent signal across cohorts (C2/C3) is one of the most reliable findings.
Don't over-promise quality improvements from changing the selection process. 6.0–9.0 mentor-eval distributions look essentially the same (C4); applicant-pool quality matters more than rubric details for the average fellow's evaluation.
Run the full 10.0 rubric on 8.0/9.0 resumes (not just tier sorting) to enable a fair retrospective C5-style comparison.
Continue tracking demographics and pool composition year-over-year (C6/C7).
Continue the alumni survey, with awareness of selection bias (C8).

Individual reports

Analysis	Question	n
C1 — CodeSignal paradox across cohorts	Does the 8.0 paradox replicate?	7.0/8.0/9.0 (CS + mentor evals)
C2 — What predicts mentor evals?	Application features → mentor evals, per cohort	7.0: n=37, 8.0: n=58, 9.0: n=55
C3 — What predicts publications?	Application features → post-program pubs	6.0–9.0 alumni-pub joins (n ~50–80 each)
C4 — Quality trend across cohorts	Did selection quality improve over 6.0 → 9.0?	76–106 mentor evals per cohort
C5 — 10.0 features retrospective	Would 10.0 features have predicted 8.0/9.0 performance?	8.0/9.0 joins, n ~60 each
C6 — Demographics across cohorts	Did 10.0's centralized process change demographic outcomes?	7.0–10.0, varies by group
C7 — Pool composition shifts	How has the applicant pool changed?	6.0–10.0, full pools
C8 — Alumni survey outcomes	Where do alumni go? What mechanism do they attribute MATS impact to?	123 survey respondents

Errors encountered during Part C

None unrecovered. One in-flight fix: C3's initial join was too strict (listwise drop on all features dropped most rows for 6.0/7.0/8.0). Switched to mean-imputation with cohort-specific CodeSignal column lookup; this brought all four cohorts into the model.