A3 — Are the composite weights right?

Context

The 10.0 composite weights are 0.50 Research Skills + 0.35 Technical Execution + 0.15 Soft Skills, with TE split 0.50 MLE + 0.30 SWE + 0.20 Math internally. These weights were chosen by judgment, not by fitting to data. This analysis turns the question around: if we let the data choose the weights — by fitting a model to predict ranking from the five attribute sub-scores — what weights would it produce? And how different are those from the current ones? Also: the Research Skills relevance multipliers (Direct=1.0, Adjacent=0.85, Distant=0.60) discount RS based on how the applicant's experience matches the streams they applied to. Are these the right multipliers?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

  1. Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
  2. Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
  3. Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

Question

Is the current 50/35/15 weighting (and the 50/30/20 split within TE; the 1.0/0.85/0.6 RS relevance multipliers) empirically optimal?

Empirical weights vs. current

Attribute Current Empirical (logistic) Equal Raw coef
RS·relevance 0.500 0.571 0.200 +1.275
TE (MLE) 0.175 0.000 0.200 -0.195
TE (SWE) 0.105 0.148 0.200 +0.330
TE (Math) 0.070 0.234 0.200 +0.522
Soft skills 0.150 0.047 0.200 +0.105

The "Empirical (logistic)" column is the logistic-regression coefficient, sign-clipped and normalized to sum to 1. Larger numbers mean the attribute pulls more weight toward predicting ranking.

Read: - RS·relevance carries roughly 0.57 of total weight empirically vs 0.50 in the current composite. - MLE empirical weight: 0.00 vs current 0.17. - SS empirical weight: 0.05 vs current 0.15.

AUC by scheme

Scheme AUC 95% CI
Current (0.50·RS + 0.35·TE + 0.15·SS, TE-split) 0.674 [0.598, 0.742]
Empirical (logistic, raw-coef normalized) 0.710 [0.638, 0.776]
Equal weights (0.2 × 5) 0.675 [0.604, 0.741]
Full logistic (probability) 0.709 [0.639, 0.776]

If the empirical scheme barely beats the current scheme on a CI-overlap basis, the current weights are in a flat region of the optimization landscape — small tweaks unlikely to matter. Practically, this is good news for keeping the existing rubric stable.

RS relevance multiplier grid

Holding TE and SS weights fixed, sweep the Adjacent multiplier (0.7–1.0) and Distant multiplier (0.4–0.7) and recompute composite + AUC.

If the gap is <0.01, the current multipliers are essentially optimal. If it's larger, consider what the grid maximum implies — a higher Adjacent multiplier means "Adjacent relevance is closer to Direct than we currently say", and a lower Distant multiplier means "Distant relevance should count less than current."

Takeaways

  1. Empirically, Research Skills (with relevance) carries the most weight by far (~57% vs the current 50%), followed by Math. The other attributes do less work than the current weights assume.
  2. MLE's empirical weight is at floor — its raw logistic coefficient came back negative (clipped to zero in the display). This is one of the more striking results: the rubric's MLE component is not helping the model predict who gets ranked, at least at Stage 3. Could be multicollinearity with SWE/Math, could be a real signal that MLE-heavy reviews don't track what streams care about. Worth investigating.
  3. The 50/35/15 split is in a reasonable region — re-weighting gives a small AUC bump (~+0.04). The current weights are not optimal but they're not far off.
  4. The relevance multipliers (1.0 / 0.85 / 0.60) are essentially optimal — the grid maximum is within +0.01 AUC of the current values. No reason to change them.
🔧 Debug — how the data was interpreted (click to expand; safe to skip)

Sample. Stage-3 empirical pool, listwise complete on all 5 attribute tiers (n=480, ranked n=65). Same definition as A1/A2 Stage 3 view.

Outcome variable(s). is_ranked (ranked by ≥1 stream).

Predictor fields. Five attribute scores per applicant: RS × relevance multiplier, MLE, SWE, Math, SS. All read as numeric. RS multiplier uses the project-canonical {Direct: 1.0, Adjacent: 0.85, Distant: 0.6}.

Filters applied. Stage-3 empirical + listwise-complete on attributes. Special advances kept (they're real Stage-3 applicants). Nanda not excluded (pool-level).

Missing-data handling. Listwise drop on the 5 attribute columns.

Key assumptions / caveats.