In 10.0, the Stage-1 LLM didn't just produce a pass/fail decision — for every (applicant, stream) pair, it produced a per-stream recommendation: one of Strong advance / Lean advance / Lean reject / Strong reject. These were advisory: streams saw them at Stage 3 but made their own ranking decisions. This analysis asks how often the LLM's recommendation actually matched the stream's eventual decision.
MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).
The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:
For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.
Outcome definitions used throughout these analyses:
is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).The Stage-1 LLM reviewer was prompted with each applicant's application materials plus a description of one stream's work. For prompt-token efficiency, the LLM also evaluated each applicant against adjacent streams in the same review prompt — i.e., the LLM produced recommendations for streams the applicant didn't actually apply to (these recs were 'cached' in case the applicant later applied). This analysis filters out the non-applicable recs: only LLM recs for streams the applicant truly applied to at Stage 3 are counted (71% of raw recs are dropped by this filter).
Overall, when the LLM said advance (Strong or Lean), the applicant was actually ranked by that stream 3.0% of the time. Recall — what fraction of actual rankings the LLM predicted with an advance — is 97.7%.
| rec | Not ranked | Ranked | Total |
|---|---|---|---|
| Lean advance | 6153 | 129 | 6282 |
| Lean reject | 795 | 6 | 801 |
| Strong advance | 1936 | 122 | 2058 |
| Strong reject | 2 | 0 | 2 |
| Total | 8886 | 257 | 9143 |
The diagonal-ish ratio of advance:ranked tells you how often LLM 'advance' tracks reality. The off-diagonals are where the LLM and the stream disagree.
| Stream | # advance | # reject | # ranked | Advance precision | Advance recall | Reject rank rate |
|---|---|---|---|---|---|---|
| Lee Sharkey | 385 | 11 | 5 | 0.01 | 1.00 | 0.00 |
| Mauricio Baker | 358 | 0 | 3 | 0.01 | 1.00 | — |
| Krishnamurthy Dvijotham (Dj) | 358 | 4 | 3 | 0.01 | 1.00 | 0.00 |
| Arthur Conmy | 329 | 4 | 7 | 0.02 | 1.00 | 0.00 |
| David Lindner | 299 | 7 | 2 | 0.01 | 1.00 | 0.00 |
| Redwood Research | 295 | 19 | 4 | 0.01 | 1.00 | 0.00 |
| Dan Mossing | 288 | 8 | 4 | 0.01 | 1.00 | 0.00 |
| Tomek Korbak | 266 | 1 | 5 | 0.02 | 1.00 | 0.00 |
| UKAISI Red-Team | 249 | 10 | 10 | 0.04 | 1.00 | 0.00 |
| AI Futures Project | 245 | 6 | 0 | 0.00 | — | 0.00 |
| Paul Riechers, Adam Shai | 240 | 1 | 6 | 0.03 | 1.00 | 0.00 |
| Michael Chen | 237 | 15 | 0 | 0.00 | — | 0.00 |
| Alan Cooney | 223 | 0 | 6 | 0.03 | 1.00 | — |
| Adrià Garriga-Alonso | 222 | 9 | 0 | 0.00 | — | 0.00 |
| Dan Murfet, Jesse Hoogland | 216 | 29 | 7 | 0.03 | 1.00 | 0.00 |
| Oliver Sourbut | 216 | 16 | 5 | 0.02 | 1.00 | 0.00 |
| Team Shard | 216 | 1 | 12 | 0.06 | 1.00 | 0.00 |
| Victoria Krakovna | 211 | 16 | 7 | 0.03 | 1.00 | 0.00 |
| Mary Phuong | 205 | 17 | 6 | 0.03 | 1.00 | 0.00 |
| Epoch AI | 192 | 6 | 1 | 0.01 | 1.00 | 0.00 |
| Sarah Schwettmann, Jacob Steinhardt | 181 | 2 | 2 | 0.01 | 1.00 | 0.00 |
| Maksym Andriushchenko | 177 | 0 | 10 | 0.06 | 1.00 | — |
| Alignment Research Center (ARC) | 177 | 10 | 6 | 0.03 | 1.00 | 0.00 |
| Marius Hobbhahn | 175 | 2 | 10 | 0.06 | 1.00 | 0.00 |
| Jacob Merizian | 174 | 0 | 9 | 0.05 | 1.00 | — |
| Daniel Kang | 174 | 3 | 0 | 0.00 | — | 0.00 |
| He He | 161 | 6 | 5 | 0.03 | 1.00 | 0.00 |
| Jeff Alstott | 152 | 11 | 25 | 0.14 | 0.88 | 0.27 |
| Roger Grosse | 152 | 72 | 8 | 0.05 | 1.00 | 0.00 |
| Stephen Casper (Cas) | 146 | 1 | 6 | 0.04 | 1.00 | 0.00 |
| LawZero | 146 | 5 | 3 | 0.02 | 1.00 | 0.00 |
| Megan Kinniment | 138 | 0 | 2 | 0.01 | 1.00 | — |
| Cristian Trout | 134 | 32 | 4 | 0.03 | 1.00 | 0.00 |
| Shi Feng | 122 | 0 | 5 | 0.04 | 1.00 | — |
| Patrick Butlin | 114 | 19 | 2 | 0.02 | 1.00 | 0.00 |
| Safe AI Forum | 106 | 3 | 5 | 0.05 | 1.00 | 0.00 |
| Alexis Carlier, Zainab Ali Majid | 91 | 11 | 3 | 0.03 | 1.00 | 0.00 |
| Matthew Gentzel | 86 | 21 | 8 | 0.09 | 1.00 | 0.00 |
| Neev Parikh | 81 | 1 | 9 | 0.11 | 1.00 | 0.00 |
| Richard Ngo | 80 | 0 | 13 | 0.16 | 1.00 | — |
| Janet Egan | 69 | 60 | 4 | 0.04 | 0.75 | 0.02 |
| Keri Warr | 62 | 20 | 2 | 0.03 | 1.00 | 0.00 |
| Peter Henderson | 53 | 242 | 5 | 0.06 | 0.60 | 0.01 |
| Abram Demski | 49 | 11 | 13 | 0.27 | 1.00 | 0.00 |
| Forethought | 48 | 82 | 0 | 0.00 | — | 0.00 |
| Gabriel Kulp | 42 | 9 | 5 | 0.12 | 1.00 | 0.00 |
Streams sorted by precision. Low-precision streams are either (a) the LLM is over-issuing advances to candidates this stream doesn't pick, or (b) the stream's selection criterion differs sharply from what the LLM is encoding.
Bottom-left = streams where the LLM rarely advances but doesn't hit even when it does. Top-right = streams with many advances and high precision (LLM tracks them well).
Sample. All Stage-3 applicants with a non-empty Stage 3 streams actually applied to list (n=1,059 after dedup). Compilation column parsed; recs filtered to TRUE Stage-3 applications only (per CLAUDE.md caveat #4). Nanda recs excluded. Total LLM recs after filter: 9,143 across 46 streams.
Outcome variable(s). Per (applicant, stream) pair: did the stream actually rank this applicant? Derived from streams_ranked_by (display names mapped back to Internal handles via the streams table).
Predictor fields. LLM recommendation label ∈ {Strong advance, Lean advance, Lean reject, Strong reject}. Parsed from [stage-1-track-review] [AI] Compilation of all stream reviews with regex ^([^()]+)\(([^)]*)\):(.+)$ on |-delimited segments.
Filters applied. Recs filtered against applicant's true Stage 3 application list. Recs for streams the applicant didn't truly apply to (grouped-prompt side effect) are dropped.
Missing-data handling. Unparseable segments dropped silently. Recs with handles not present in the streams table are kept (just not mapped to display names).
Key assumptions / caveats.
n_total_recs looks unexpectedly low.