A7 — LLM stream rec accuracy

Context

In 10.0, the Stage-1 LLM didn't just produce a pass/fail decision — for every (applicant, stream) pair, it produced a per-stream recommendation: one of Strong advance / Lean advance / Lean reject / Strong reject. These were advisory: streams saw them at Stage 3 but made their own ranking decisions. This analysis asks how often the LLM's recommendation actually matched the stream's eventual decision.

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

How the LLM recommendations were produced

The Stage-1 LLM reviewer was prompted with each applicant's application materials plus a description of one stream's work. For prompt-token efficiency, the LLM also evaluated each applicant against adjacent streams in the same review prompt — i.e., the LLM produced recommendations for streams the applicant didn't actually apply to (these recs were 'cached' in case the applicant later applied). This analysis filters out the non-applicable recs: only LLM recs for streams the applicant truly applied to at Stage 3 are counted (71% of raw recs are dropped by this filter).

Definitions

Precision (for 'advance' recs): of applicants the LLM said to advance for stream X, what fraction did stream X actually rank? Note: streams rank only their top ~5-15 applicants, so absolute precision on a ranking outcome is necessarily low.
Recall: of applicants stream X actually ranked, what fraction did the LLM correctly flag as 'advance'?
Reject rank rate: of applicants the LLM said to reject for stream X, what fraction did stream X still rank? Should be near 0 if the LLM's reject calls are reliable.

Headline

Overall, when the LLM said advance (Strong or Lean), the applicant was actually ranked by that stream 3.0% of the time. Recall — what fraction of actual rankings the LLM predicted with an advance — is 97.7%.

Confusion matrix (LLM rec × actually ranked)

The diagonal-ish ratio of advance:ranked tells you how often LLM 'advance' tracks reality. The off-diagonals are where the LLM and the stream disagree.

Per-stream metrics

Streams sorted by precision. Low-precision streams are either (a) the LLM is over-issuing advances to candidates this stream doesn't pick, or (b) the stream's selection criterion differs sharply from what the LLM is encoding.

Precision vs. volume

Bottom-left = streams where the LLM rarely advances but doesn't hit even when it does. Top-right = streams with many advances and high precision (LLM tracks them well).

rec	Not ranked	Ranked	Total
Lean advance	6153	129	6282
Lean reject	795	6	801
Strong advance	1936	122	2058
Strong reject	2	0	2
Total	8886	257	9143

Stream	# advance	# reject	# ranked	Advance precision	Advance recall	Reject rank rate
Lee Sharkey	385	11	5	0.01	1.00	0.00
Mauricio Baker	358	0	3	0.01	1.00	—
Krishnamurthy Dvijotham (Dj)	358	4	3	0.01	1.00	0.00
Arthur Conmy	329	4	7	0.02	1.00	0.00
David Lindner	299	7	2	0.01	1.00	0.00
Redwood Research	295	19	4	0.01	1.00	0.00
Dan Mossing	288	8	4	0.01	1.00	0.00
Tomek Korbak	266	1	5	0.02	1.00	0.00
UKAISI Red-Team	249	10	10	0.04	1.00	0.00
AI Futures Project	245	6	0	0.00	—	0.00
Paul Riechers, Adam Shai	240	1	6	0.03	1.00	0.00
Michael Chen	237	15	0	0.00	—	0.00
Alan Cooney	223	0	6	0.03	1.00	—
Adrià Garriga-Alonso	222	9	0	0.00	—	0.00
Dan Murfet, Jesse Hoogland	216	29	7	0.03	1.00	0.00
Oliver Sourbut	216	16	5	0.02	1.00	0.00
Team Shard	216	1	12	0.06	1.00	0.00
Victoria Krakovna	211	16	7	0.03	1.00	0.00
Mary Phuong	205	17	6	0.03	1.00	0.00
Epoch AI	192	6	1	0.01	1.00	0.00
Sarah Schwettmann, Jacob Steinhardt	181	2	2	0.01	1.00	0.00
Maksym Andriushchenko	177	0	10	0.06	1.00	—
Alignment Research Center (ARC)	177	10	6	0.03	1.00	0.00
Marius Hobbhahn	175	2	10	0.06	1.00	0.00
Jacob Merizian	174	0	9	0.05	1.00	—
Daniel Kang	174	3	0	0.00	—	0.00
He He	161	6	5	0.03	1.00	0.00
Jeff Alstott	152	11	25	0.14	0.88	0.27
Roger Grosse	152	72	8	0.05	1.00	0.00
Stephen Casper (Cas)	146	1	6	0.04	1.00	0.00
LawZero	146	5	3	0.02	1.00	0.00
Megan Kinniment	138	0	2	0.01	1.00	—
Cristian Trout	134	32	4	0.03	1.00	0.00
Shi Feng	122	0	5	0.04	1.00	—
Patrick Butlin	114	19	2	0.02	1.00	0.00
Safe AI Forum	106	3	5	0.05	1.00	0.00
Alexis Carlier, Zainab Ali Majid	91	11	3	0.03	1.00	0.00
Matthew Gentzel	86	21	8	0.09	1.00	0.00
Neev Parikh	81	1	9	0.11	1.00	0.00
Richard Ngo	80	0	13	0.16	1.00	—
Janet Egan	69	60	4	0.04	0.75	0.02
Keri Warr	62	20	2	0.03	1.00	0.00
Peter Henderson	53	242	5	0.06	0.60	0.01
Abram Demski	49	11	13	0.27	1.00	0.00
Forethought	48	82	0	0.00	—	0.00
Gabriel Kulp	42	9	5	0.12	1.00	0.00

A7 — How well did the LLM's stream recommendations match actual rankings?

Context

Headline

Confusion matrix (LLM rec × actually ranked)

Per-stream metrics

Precision vs. volume

Takeaways