D2 — How consistent are streams with the central composite?

Context

Within each stream that ranked applicants, how closely does the stream's ranking track the composite score? High Spearman correlation = the stream is following the composite rubric closely. Near-zero or negative correlation = the stream is using criteria not captured by the composite.

Practically: low-consistency streams are either (a) finding something in applicants the composite misses, or (b) ranking on noise. Both are interesting cases.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Stream ranking convention

Each stream submits an ordered list of applicants. Rank 1 = the stream's top pick. We negate the rank so that higher ρ = composite tracks the stream's order well.

Per-stream consistency

Stream n ranked ρ(composite ↔ rank) Top attribute ρ
Gabriel Kulp 5 -0.60
Arthur Conmy 8 -0.60 Math (-0.30)
Alignment Research Center (ARC) 6 -0.52
Peter Henderson 5 -0.40 MLE (-0.87)
Maksym Andriushchenko 10 -0.37 MLE (-0.36)
Neev Parikh 9 -0.22 MLE (+0.35)
Team Shard 12 -0.03 RS (+0.35)
Tomek Korbak 5 +0.00 SWE (-0.97)
Dan Murfet, Jesse Hoogland 5 +0.00
Jacob Merizian 9 +0.12 SWE (+0.58)
Abram Demski 13 +0.15
Marius Hobbhahn 10 +0.16 SWE (+0.51)
Shi Feng 5 +0.20 SS (+0.82)
Jeff Alstott 24 +0.23 Math (-0.51)
UKAISI Red-Team 10 +0.24 Math (+0.47)
Richard Ngo 13 +0.25 Math (+0.67)
Alan Cooney 6 +0.26 MLE (-0.51)
Stephen Casper (Cas) 6 +0.31 MLE (+0.87)
Mary Phuong 6 +0.43 Math (+0.68)
Matthew Gentzel 8 +0.59
Victoria Krakovna 7 +0.61 Math (+0.56)
Safe AI Forum 5 +0.67
He He 5 +0.70 RS (+0.71)
Anthropic and OpenAI Megastream 6 +0.71 SWE (-0.52)
Roger Grosse 8 +0.88 RS (+0.82)

High-consistency streams (ρ > 0.3): Stephen Casper (Cas), Mary Phuong, Matthew Gentzel, Victoria Krakovna, Safe AI Forum, He He, Anthropic and OpenAI Megastream, Roger Grosse. Low / negative consistency (ρ < 0): Gabriel Kulp, Arthur Conmy, Alignment Research Center (ARC), Peter Henderson, Maksym Andriushchenko, Neev Parikh, Team Shard.

Per-attribute consistency

If composite doesn't track a stream's rank well, maybe one of the individual attribute scores does. The heatmap below shows which attribute correlates most strongly with each stream's rank.

For each row, look for the cell with the strongest positive correlation — that's the attribute the stream seems to weight most heavily in its ranking.

Takeaways

  1. Streams vary substantially in how closely they track the composite. Some are essentially in lockstep (ρ > 0.5); others have near-zero or negative correlation with composite — they're ranking on something else.
  2. For low-consistency streams, the per-attribute heatmap reveals what they are tracking. Combined with A6's per-cluster regressions, we can characterize "this stream cares about Math more than the composite weights" or similar.
  3. For 11.0: streams with negative composite-rank correlation are worth talking to. Are they finding something the composite misses, or ranking idiosyncratically? An advisory per-stream signal (per A6) would help close this gap.
  4. Sample sizes are small for many streams (5–20 ranked applicants). Per-stream point estimates of ρ should be treated as suggestive.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. All (applicant, stream) pairs in 10.0 where the stream ranked the applicant. Nanda excluded. Streams with n_ranked < 5 dropped. Total streams analyzed: 25.

Outcome variable(s). Stream-side rank position (negated so higher ρ = consistent with composite).

Predictor fields. Empirical composite + 5 attribute tier scores.

Filters applied. Nanda excluded per memory. Streams with n_ranked < 5 excluded due to rank-correlation instability.

Missing-data handling. Per-cell listwise drop.

Key assumptions / caveats.