B5 — Does the policy/governance rubric work?

Context

The empirical pipeline gets most of the attention, but ~445 10.0 applicants applied through Policy & Strategy (P&S) and ~250 through Technical Governance (TG). These tracks use a different Stage-2 rubric — Research Skills is paired with Analytical Communication (instead of Technical Execution), and Research Skills gets two relevance multipliers (policy relevance and technical relevance) rather than one. Does the policy/gov rubric work?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Policy/gov rubric details

Headline

P&S composite → is_ranked (P&S applicants only, n=445): AUC = 0.706 [0.584, 0.808]. TG composite → is_ranked (TG applicants only, n=373): AUC = 0.752 [0.645, 0.851].

Both composites discriminate above chance. Compared to the empirical composite (A1 full-pool AUC ≈ 0.82), policy/gov composites' performance is similar in magnitude.

AUC summary

Sample Predictor n n_pos AUC 95% CI
P&S applicants ps_composite 445 30 0.706 [0.584, 0.808]
TG applicants tg_composite 373 25 0.752 [0.645, 0.851]
Combined P&S+TG ps_composite 584 39 0.724 [0.633, 0.810]
Policy review sample (n≈424) Research Skills (P/G rating) 423 32 0.790 [0.730, 0.845]
Policy review sample (n≈424) Analytical Communication 423 32 0.664 [0.573, 0.742]
Policy review sample (n≈424) Soft Skills (P/G) 423 32 0.607 [0.517, 0.693]

Attribute distributions (P/G review sample, n≈424)

P&S composite by ranked status

Dual relevance tag distributions

Policy relevance (P&S applicants):

Technical relevance (TG applicants):

Most applicants are tagged Direct or Adjacent on at least one relevance axis. Few are tagged Distant on both — the relevance system is working as designed.

Writing-sample work test

Among 82 policy/gov applicants who completed the work test, work-test score → is_ranked AUC = 0.533 [0.378, 0.685] (small n; CI is wide). The grading tier (Exceeds/Meets/Near/Below) tracks scores monotonically — usable categorical summary.

Writing-sample rescue rate

Applicants with low Research Skills tier (≤2) in the policy review sample: 91 applicants. Of these, 0 were ranked. The rescue rate (low-RS → ranked) is 0.0%.

Takeaways

  1. The policy/gov rubric works — both P&S and TG composites discriminate above-chance for ranking, with AUCs broadly in the same range as the empirical composite.
  2. Analytical Communication is doing real work — its univariate AUC in the policy review sample is similar to RS, suggesting it captures signal RS doesn't.
  3. Dual relevance multipliers are pulling weight — most applicants get a non-zero relevance score on at least one axis. Few applicants are filtered out by being Distant on both.
  4. Writing-sample work test data is sparse (n=82) — wide CIs, no strong conclusions. Worth keeping for now and revisiting after 11.0 collects more.
  5. For 11.0: the policy/gov rubric is broadly fine. The biggest open question is whether to keep the writing-sample work test as a Stage-3 step given its small sample and modest signal-to-cost ratio.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. P&S track applicants: n=445. TG track applicants: n=373. Combined: n=584. Policy/gov-review sample (anyone with a non-null P/G RS score): n=423.

Outcome variable(s). is_ranked (any stream).

Predictor fields. [stage-2-policy-gov-review] Policy & Strategy composite score, [stage-3-policy-gov] Technical Governance composite score, and the underlying tier ratings (RS, A&C, SS). Work-test overall score for the n=82 takers.

Filters applied. Canonical dedup. Track masks from [stage-1-track] Selected tracks.

Missing-data handling. Per-predictor listwise drop.

Key assumptions / caveats.