B5 — Policy/gov pipeline

Context

The empirical pipeline gets most of the attention, but ~445 10.0 applicants applied through Policy & Strategy (P&S) and ~250 through Technical Governance (TG). These tracks use a different Stage-2 rubric — Research Skills is paired with Analytical Communication (instead of Technical Execution), and Research Skills gets two relevance multipliers (policy relevance and technical relevance) rather than one. Does the policy/gov rubric work?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

Policy/gov rubric details

Research Skills (P/G) — same 0–4 tier as empirical, but graded with policy-relevant criteria.
Analytical Communication — the policy/gov analog of empirical Technical Execution. Measures clarity of argument, structured analysis, and policy writing quality.
Soft Skills — same construct as empirical.
Dual relevance multipliers — both Policy and Technical Governance composites use Research Skills, but each applies a different relevance multiplier so the same applicant can have different effective RS scores across P&S vs TG views.
Some Stage-3 applicants took a writing-sample work test (n=82 completed).

Headline

P&S composite → is_ranked (P&S applicants only, n=445): AUC = 0.706 [0.584, 0.808]. TG composite → is_ranked (TG applicants only, n=373): AUC = 0.752 [0.645, 0.851].

Both composites discriminate above chance. Compared to the empirical composite (A1 full-pool AUC ≈ 0.82), policy/gov composites' performance is similar in magnitude.

AUC summary

Attribute distributions (P/G review sample, n≈424)

P&S composite by ranked status

Dual relevance tag distributions

Most applicants are tagged Direct or Adjacent on at least one relevance axis. Few are tagged Distant on both — the relevance system is working as designed.

Writing-sample work test

Sample	Predictor	n	n_pos	AUC	95% CI
P&S applicants	`ps_composite`	445	30	0.706	[0.584, 0.808]
TG applicants	`tg_composite`	373	25	0.752	[0.645, 0.851]
Combined P&S+TG	`ps_composite`	584	39	0.724	[0.633, 0.810]
Policy review sample (n≈424)	`Research Skills (P/G rating)`	423	32	0.790	[0.730, 0.845]
Policy review sample (n≈424)	`Analytical Communication`	423	32	0.664	[0.573, 0.742]
Policy review sample (n≈424)	`Soft Skills (P/G)`	423	32	0.607	[0.517, 0.693]

Among 82 policy/gov applicants who completed the work test, work-test score → is_ranked AUC = 0.533 [0.378, 0.685] (small n; CI is wide). The grading tier (Exceeds/Meets/Near/Below) tracks scores monotonically — usable categorical summary.

Writing-sample rescue rate

Applicants with low Research Skills tier (≤2) in the policy review sample: 91 applicants. Of these, 0 were ranked. The rescue rate (low-RS → ranked) is 0.0%.

B5 — Does the policy/governance rubric work?