Part B — 10.0 process & design questions

Context

Part A asked whether the central 10.0 selection rubric works (it broadly does). Part B steps through a series of process and design questions that came up during 10.0, each one informing a specific 11.0 decision. 9 analyses, mostly descriptive plus regression where data supports it.

How the 10.0 selection pipeline worked (refresher; click to expand)

~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background, picked tracks, took an LLM-graded screen. Some applicants were filtered out here.
Stage 2 — LLM-graded rubric assigns a composite score; top ~600 go to Stage 3.
Stage 3 — applicants choose specific streams (mentor-led projects) to apply to. Streams rank applicants; ~120 offers go out.

For the empirical track, composite = 0.50·RS + 0.35·TE + 0.15·SS with TE = 0.50·MLE + 0.30·SWE + 0.20·Math; RS multiplied by a relevance multiplier (1.0/0.85/0.6 for Direct/Adjacent/Distant).

Headline findings

The funnel is dominated by Stage 2 → Stage 3 filtering. Of 2,203 canonical applications, ~60% of Stage-1-passers are dropped at Stage 2. By the end, 189 (8.6%) are ranked and 126 (5.7%) get offers. B1 results.
~73% of committed applicants matched with a top-3 preference stream. A substantial minority of applicants reach matches with streams they didn't initially flag at Stage 1 — supporting making Stage-1 stream questions optional / low-friction for 11.0. B2 results.
ToC alignment score is the strongest of the mission-alignment signals. AUC = 0.65 in the full pool, attenuating to ~0.61 in Stage 3. Multi-select AIS-engagement count and free-text duration carry less signal. (Important caveat: the 10.0 AIS form had a bug — secondary detail panels were swapped. The main multi-select is unaffected; this analysis only uses the main field.) B3 results.
"AI safety org" references carry the strongest reference-type signal (+12.4% lift in P(ranked) vs applicants without one). Government / policy / other-industry refs skew negative. B4 results.
The policy/gov rubric works. P&S composite AUC = 0.71, TG composite AUC = 0.75. Analytical Communication is doing real work as the policy-track analog of empirical's Technical Execution. Dual relevance multipliers pull weight. B5 results.
Composite has a strong floor in Stage 3. The bottom several deciles of Stage-3 composite rank at near-zero rates. The curve is concave with diminishing returns above the top decile. Supports using composite as a hard floor rather than just an informational signal. B6 results.
AI-detected text in applications is a modest negative signal. AUC ~0.57 for max fraction_ai → not ranked. Composite is lower for high-fraction_ai applicants, suggesting reviewers are implicitly (or explicitly) discounting AI-written content. B7 results.
Returning applicants outperform first-timers — but mostly via clearing earlier stages. 86/567 returning applicants got ranked (15.2%) vs 102/1633 first-timers (6.2%). Conditional on reaching Stage 3, the gap narrows substantially. B8 results.
"Applying to more streams" is mostly a quality proxy. Stream count correlates +0.43 with composite. After controlling for composite in a joint logit, the marginal effect of applying to more streams is small. No strong case for capping #streams in 11.0. B9 results.

11.0 implications (tentative)

Make Stage-1 stream questions optional / low-friction across the board (B2 + B6). Applicants and streams find each other late in the process; the early stream-selection is more burden than value for most.
Treat composite as a hard floor, not just an informational signal (B6). A floor around the 30–40th percentile of the Stage-3 sub-pool would eliminate ~no-yield review without dropping anyone with a real chance.
Keep the ToC alignment question (B3). It's a cheap and effective screen. Consider replacing the flat AIS-engagement multi-select with something that captures depth.
Make the AI-use policy more explicit (B7). The current implicit penalty seems to work; an explicit policy might both deter use and improve fairness.
Track returning applicants more carefully (B8). Add a "which cohort did you return from?" question and link to prior outcomes — this would help disentangle self-selection from genuine improvement.
Surface AI-safety-org reference flag to Stage-3 reviewers (B4). The strongest single reference-type signal; cheap to highlight.

Individual reports

Analysis	Question	n
B1 — Selection funnel	How many applicants made it to each stage?	2,203 canonical
B2 — Stream matching & self-selection	Are people matching with their top stream picks? Do they end up at streams they didn't initially flag?	108 committers, 1,064 Stage-3
B3 — Mission alignment signals	Do AIS engagement signals predict ranking?	2,206 ToC scores
B4 — Reference signal value	Do references (and their type) predict ranking?	~1,600 with ≥1 ref
B5 — Policy/governance pipeline	Does the policy/gov rubric work?	P&S: 445, TG: ~250
B6 — Percentile vs progression	Floor + diminishing returns of composite percentile	~604 Stage-3 empirical
B7 — Pangram / AI-detection	Did AI-written text in applications hurt outcomes?	~2,200 with Pangram signal
B8 — Returning applicants	Do returning applicants do better?	569 returning, 1,638 first-time
B9 — Application strategy	Does applying to more streams help?	1,064 Stage-3 applicants

Errors encountered during Part B

None unrecovered. One in-flight caveat surfaced: the 10.0 AIS engagement form had a UI bug that swapped the secondary "research program" and "structured course" detail panels (Sanyu flagged at B5 checkpoint). This affects only the secondary detail fields; this analysis (B3) uses the main multi-select count and the duration field, both of which are unaffected. Caveat is documented in B3 and saved to project memory for future analyses.