A7 — How well did the LLM's stream recommendations match actual rankings?

Context

In 10.0, the Stage-1 LLM didn't just produce a pass/fail decision — for every (applicant, stream) pair, it produced a per-stream recommendation: one of Strong advance / Lean advance / Lean reject / Strong reject. These were advisory: streams saw them at Stage 3 but made their own ranking decisions. This analysis asks how often the LLM's recommendation actually matched the stream's eventual decision.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

How the LLM recommendations were produced

The Stage-1 LLM reviewer was prompted with each applicant's application materials plus a description of one stream's work. For prompt-token efficiency, the LLM also evaluated each applicant against adjacent streams in the same review prompt — i.e., the LLM produced recommendations for streams the applicant didn't actually apply to (these recs were 'cached' in case the applicant later applied). This analysis filters out the non-applicable recs: only LLM recs for streams the applicant truly applied to at Stage 3 are counted (71% of raw recs are dropped by this filter).

Definitions

Headline

Overall, when the LLM said advance (Strong or Lean), the applicant was actually ranked by that stream 3.0% of the time. Recall — what fraction of actual rankings the LLM predicted with an advance — is 97.7%.

Confusion matrix (LLM rec × actually ranked)

rec Not ranked Ranked Total
Lean advance 6153 129 6282
Lean reject 795 6 801
Strong advance 1936 122 2058
Strong reject 2 0 2
Total 8886 257 9143

The diagonal-ish ratio of advance:ranked tells you how often LLM 'advance' tracks reality. The off-diagonals are where the LLM and the stream disagree.

Per-stream metrics

Stream # advance # reject # ranked Advance precision Advance recall Reject rank rate
Lee Sharkey 385 11 5 0.01 1.00 0.00
Mauricio Baker 358 0 3 0.01 1.00
Krishnamurthy Dvijotham (Dj) 358 4 3 0.01 1.00 0.00
Arthur Conmy 329 4 7 0.02 1.00 0.00
David Lindner 299 7 2 0.01 1.00 0.00
Redwood Research 295 19 4 0.01 1.00 0.00
Dan Mossing 288 8 4 0.01 1.00 0.00
Tomek Korbak 266 1 5 0.02 1.00 0.00
UKAISI Red-Team 249 10 10 0.04 1.00 0.00
AI Futures Project 245 6 0 0.00 0.00
Paul Riechers, Adam Shai 240 1 6 0.03 1.00 0.00
Michael Chen 237 15 0 0.00 0.00
Alan Cooney 223 0 6 0.03 1.00
Adrià Garriga-Alonso 222 9 0 0.00 0.00
Dan Murfet, Jesse Hoogland 216 29 7 0.03 1.00 0.00
Oliver Sourbut 216 16 5 0.02 1.00 0.00
Team Shard 216 1 12 0.06 1.00 0.00
Victoria Krakovna 211 16 7 0.03 1.00 0.00
Mary Phuong 205 17 6 0.03 1.00 0.00
Epoch AI 192 6 1 0.01 1.00 0.00
Sarah Schwettmann, Jacob Steinhardt 181 2 2 0.01 1.00 0.00
Maksym Andriushchenko 177 0 10 0.06 1.00
Alignment Research Center (ARC) 177 10 6 0.03 1.00 0.00
Marius Hobbhahn 175 2 10 0.06 1.00 0.00
Jacob Merizian 174 0 9 0.05 1.00
Daniel Kang 174 3 0 0.00 0.00
He He 161 6 5 0.03 1.00 0.00
Jeff Alstott 152 11 25 0.14 0.88 0.27
Roger Grosse 152 72 8 0.05 1.00 0.00
Stephen Casper (Cas) 146 1 6 0.04 1.00 0.00
LawZero 146 5 3 0.02 1.00 0.00
Megan Kinniment 138 0 2 0.01 1.00
Cristian Trout 134 32 4 0.03 1.00 0.00
Shi Feng 122 0 5 0.04 1.00
Patrick Butlin 114 19 2 0.02 1.00 0.00
Safe AI Forum 106 3 5 0.05 1.00 0.00
Alexis Carlier, Zainab Ali Majid 91 11 3 0.03 1.00 0.00
Matthew Gentzel 86 21 8 0.09 1.00 0.00
Neev Parikh 81 1 9 0.11 1.00 0.00
Richard Ngo 80 0 13 0.16 1.00
Janet Egan 69 60 4 0.04 0.75 0.02
Keri Warr 62 20 2 0.03 1.00 0.00
Peter Henderson 53 242 5 0.06 0.60 0.01
Abram Demski 49 11 13 0.27 1.00 0.00
Forethought 48 82 0 0.00 0.00
Gabriel Kulp 42 9 5 0.12 1.00 0.00

Streams sorted by precision. Low-precision streams are either (a) the LLM is over-issuing advances to candidates this stream doesn't pick, or (b) the stream's selection criterion differs sharply from what the LLM is encoding.

Precision vs. volume

Bottom-left = streams where the LLM rarely advances but doesn't hit even when it does. Top-right = streams with many advances and high precision (LLM tracks them well).

Takeaways

  1. LLM 'reject' calls are highly accurate — applicants the LLM rejected almost never end up ranked. The reject side of the rubric is doing real work.
  2. LLM 'advance' calls are permissive — overall precision around 3%. This is partly structural (streams rank only their top ~5–15 applicants, so a much wider 'advance' pool than the eventual top-N is inevitable), and partly the LLM being liberal at the screening boundary.
  3. The right framing is "the LLM is a high-recall, low-precision pre-filter" — it reliably catches the people streams will eventually rank, with a lot of false positives in the middle. That's the correct shape for a screening tool, but it means LLM advances aren't a substitute for the stream's own review.
  4. For 11.0 prompt redesign, focus on streams with the lowest precision (over-advancing) or lowest recall (missing actual top picks).
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. All Stage-3 applicants with a non-empty Stage 3 streams actually applied to list (n=1,059 after dedup). Compilation column parsed; recs filtered to TRUE Stage-3 applications only (per CLAUDE.md caveat #4). Nanda recs excluded. Total LLM recs after filter: 9,143 across 46 streams.

Outcome variable(s). Per (applicant, stream) pair: did the stream actually rank this applicant? Derived from streams_ranked_by (display names mapped back to Internal handles via the streams table).

Predictor fields. LLM recommendation label ∈ {Strong advance, Lean advance, Lean reject, Strong reject}. Parsed from [stage-1-track-review] [AI] Compilation of all stream reviews with regex ^([^()]+)\(([^)]*)\):(.+)$ on |-delimited segments.

Filters applied. Recs filtered against applicant's true Stage 3 application list. Recs for streams the applicant didn't truly apply to (grouped-prompt side effect) are dropped.

Missing-data handling. Unparseable segments dropped silently. Recs with handles not present in the streams table are kept (just not mapped to display names).

Key assumptions / caveats.