Part C — Cross-cohort validation & discovery (7.0–10.0)

Context

Parts A and B looked at 10.0 alone. Part C uses the multi-cohort dataset (6.0 through 10.0, plus alumni outcome data) to validate findings, test for trends, and ask discovery questions we can't answer from 10.0 alone. 8 analyses across cohorts.

Cohort timeline (click to expand)

Headline findings

  1. The "CodeSignal paradox" replicates across cohorts 7.0, 8.0, and 9.0. CodeSignal predicts who passes the bar (AUC 0.70–0.78) but does NOT predict mentor evaluations of in-program performance (ρ ≈ 0 across cohorts). Three cohorts of evidence. C1 results.
  2. Application features explain <25% of mentor-eval variance. Across 7.0/8.0/9.0, application-time predictors hit R² = 0.08 (7.0), 0.34 (8.0), 0.26 (9.0). Selection from applications is fundamentally noisy. C2 results.
  3. Centralized "publication record" review carries consistent signal for predicting both mentor evals (C2) and post-program publications (C3, in-sample AUCs 0.72–0.78 across cohorts). The clearest "keep this signal" finding. C3 results.
  4. Mentor-eval distributions are remarkably stable across 6.0/7.0/8.0/9.0. Mean composite hovers at 7.2–7.5/10; fraction "high quality" (≥8) sits at 25–35% across cohorts. The shift from decentralized to partially-centralized in 9.0 didn't produce a visible quality bump. C4 results.
  5. The simplest version of 10.0's LLM tier-sort is NOT as predictive as 9.0's centralized review when applied retrospectively (R² ~0.05 vs ~0.17). But the comparison is unfair — only tier counts (not the full attribute-score aggregation) were re-run on 8.0/9.0. A useful follow-up: re-run the full 10.0 rubric on past resumes. C5 results.
  6. Demographic outcomes are broadly comparable across cohorts. No dramatic shift in gender × outcome rates with the 10.0 centralization. Pool composition is slowly diversifying. Race data limited by small subgroup n and missingness. C6 results.
  7. Pool size has roughly doubled (878 in 7.0 → 2,210 in 10.0) with modest education-composition shifts (slightly more Bachelor's-only applicants over time). C7 results.
  8. Alumni survey (n=123, opt-in): most respondents work at AI safety orgs; publications and post-MATS funding are common. Most-attributed MATS-impact mechanisms are connections and resume strengthening — not direct skill development. C8 results.

11.0 implications (tentative)

Individual reports

AnalysisQuestionn
C1 — CodeSignal paradox across cohortsDoes the 8.0 paradox replicate?7.0/8.0/9.0 (CS + mentor evals)
C2 — What predicts mentor evals?Application features → mentor evals, per cohort7.0: n=37, 8.0: n=58, 9.0: n=55
C3 — What predicts publications?Application features → post-program pubs6.0–9.0 alumni-pub joins (n ~50–80 each)
C4 — Quality trend across cohortsDid selection quality improve over 6.0 → 9.0?76–106 mentor evals per cohort
C5 — 10.0 features retrospectiveWould 10.0 features have predicted 8.0/9.0 performance?8.0/9.0 joins, n ~60 each
C6 — Demographics across cohortsDid 10.0's centralized process change demographic outcomes?7.0–10.0, varies by group
C7 — Pool composition shiftsHow has the applicant pool changed?6.0–10.0, full pools
C8 — Alumni survey outcomesWhere do alumni go? What mechanism do they attribute MATS impact to?123 survey respondents

Errors encountered during Part C

None unrecovered. One in-flight fix: C3's initial join was too strict (listwise drop on all features dropped most rows for 6.0/7.0/8.0). Switched to mean-imputation with cohort-specific CodeSignal column lookup; this brought all four cohorts into the model.