C6 — Did demographic outcomes change across cohorts?

Context

10.0 moved to a more centralized, partially-blinded review process. Did that change demographic outcomes? We track gender and race composition and outcomes across cohorts 7.0–10.0. (6.0 didn't collect demographics, so it's excluded.)

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Caveats

Gender outcomes (Man / Woman / Non-binary)

Cohort Gender n n passed P(passed) [95% CI]
10.0 Man 918 76 8.3% [6.7%, 10.2%]
10.0 Non-binary 25 1 4.0% [0.7%, 19.5%]
10.0 Woman 430 30 7.0% [4.9%, 9.8%]
7.0 Man 377 31 8.2% [5.9%, 11.4%]
7.0 Non-binary 14 2 14.3% [4.0%, 39.9%]
7.0 Woman 126 14 11.1% [6.7%, 17.8%]
8.0 Man 597 42 7.0% [5.2%, 9.4%]
8.0 Non-binary 20 2 10.0% [2.8%, 30.1%]
8.0 Woman 228 12 5.3% [3.0%, 9.0%]
9.0 Man 484 45 9.3% [7.0%, 12.2%]
9.0 Non-binary 11 2 18.2% [5.1%, 47.7%]
9.0 Woman 205 21 10.2% [6.8%, 15.2%]

Race outcomes (top 6 groups by sample size)

Pool composition by gender

Takeaways

  1. Gender outcomes are broadly comparable across cohorts. Per-cohort confidence intervals overlap for Man / Woman / Non-binary groups, with point estimates often within 1–3 percentage points of each other within a cohort. The shift to centralized review in 10.0 didn't produce a dramatic visible shift in gender-conditional outcomes.
  2. Pool composition is slowly diversifying — share of Women and Non-binary applicants has grown modestly cohort-over-cohort. Whether that's program-driven or applicant-pool-driven is not separable here.
  3. Race outcomes are harder to read due to small subgroup sizes and high missingness in 9.0. Most race × cohort cells have wide CIs.
  4. For 11.0: continue tracking these outcomes year-over-year. The cleanest version of this analysis will be available when 11.0 + 12.0 give us multiple "centralized-process" data points to compare against the decentralized cohorts.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Per cohort: completed applicants with non-null gender / race. 8.0 / 9.0 use [pre] demographic columns; 7.0 uses base demographic columns; 10.0 uses [stage-1-demographics]. All standardized via data.pygender and race columns.

Outcome variable(s). passed_mentors_bar (proxy for 7.0/8.0; true for 9.0/10.0).

Predictor fields. N/A — descriptive cross-tabs.

Filters applied. Completed applications only. Per-group n≥5 (gender) / n≥10 (race) threshold.

Missing-data handling. Per-cell listwise drop. Missing race in 9.0 is documented.

Key assumptions / caveats.