Part D — Convergent & exploratory analyses
Context
Parts A, B, and C dug into the 10.0 process directly and validated findings across cohorts. Part D is more exploratory: convergent validity, within-stream consistency, factor structure, and the relationship between intermediate quality signals (SRP/FRP) and downstream outcomes. 6 analyses.
Headline findings
- Signal-agreement is real but doesn't dramatically beat the composite alone. An applicant scoring above-median on 4 of 5 weak signals (composite, CodeSignal, ToC, research-taste, AIS count) ranks at much higher rates than one with 0–1 above-median. But the agreement-count AUC (~0.68) is only marginally better than composite-alone. D1 results.
- Streams vary substantially in how closely they track the composite. Median ρ between composite and stream rank is ~0.20; some streams have ρ > 0.5 (closely tracking), others ρ < 0 (using orthogonal criteria). For low-ρ streams, per-attribute heatmaps reveal which attribute(s) the stream actually weights. D2 results.
- Referee categories predict mentor evaluations weakly, mostly absorbed by centralized review scores. AI-safety-org refs show a modest positive lift on mentor composite, but the marginal value over centralized review scores is small (joint R² = 0.18 vs review-only R² = 0.18). D3 results.
- Research-taste Part 1 and Part 2 are substantially redundant (Spearman ρ ≈ 0.64). Considering simplification to a single-part test for 11.0 could save applicant time without losing much signal. D4 results.
- Mentor evaluations are essentially a single "overall quality" factor. PC1 explains 60–70% of variance across cohorts; all four sub-dimensions load positively and evenly. Halo effect is real and consistent. Composite ≈ mean ≈ any individual dimension for predictive purposes. D5 results.
- SRP/FRP × mentor-eval correlations are modest (and weaken over cohorts). 7.0: ρ ≈ +0.23; 8.0: +0.11; 9.0: +0.06. The 9.0 FRP rubric is most ambitious but least mentor-aligned — possibly measuring orthogonal program-quality dimensions. SRP/FRP is not a tight proxy for mentor signal. D6 results.
11.0 implications (tentative)
- Show Stage-3 reviewers a simple "weak-signal agreement count" as a secondary view alongside the composite (D1). Cheap signal that captures the convergent validity intuition.
- Talk to low-consistency streams (D2) — find out whether they're picking up signal the composite misses or ranking idiosyncratically. Per-cluster scoring (per A6) is the obvious follow-up.
- Simplify the research-taste test — one part instead of two — based on Part 1 / Part 2 redundancy (D4). Saves applicant time.
- Don't try to design dimension-specific predictive rubrics for mentor evaluations (D5). The halo effect means a single composite is what you're predicting, regardless of which dimension you target.
- Re-evaluate the FRP rubric (D6) — if it's measuring something orthogonal to mentor signal, decide whether that's intentional (it's measuring program impact, not researcher quality) or whether it's drifted unhelpfully.
Individual reports
Errors encountered during Part D
None unrecovered. One in-flight: D4's policy/gov "grading tier" field turned out to be a dict containing rationale text, not a clean tier label — switched to comparing overall scores by ranked status instead.