Part D — Convergent & exploratory analyses

Context

Parts A, B, and C dug into the 10.0 process directly and validated findings across cohorts. Part D is more exploratory: convergent validity, within-stream consistency, factor structure, and the relationship between intermediate quality signals (SRP/FRP) and downstream outcomes. 6 analyses.

Headline findings

Signal-agreement is real but doesn't dramatically beat the composite alone. An applicant scoring above-median on 4 of 5 weak signals (composite, CodeSignal, ToC, research-taste, AIS count) ranks at much higher rates than one with 0–1 above-median. But the agreement-count AUC (~0.68) is only marginally better than composite-alone. D1 results.
Streams vary substantially in how closely they track the composite. Median ρ between composite and stream rank is ~0.20; some streams have ρ > 0.5 (closely tracking), others ρ < 0 (using orthogonal criteria). For low-ρ streams, per-attribute heatmaps reveal which attribute(s) the stream actually weights. D2 results.
Referee categories predict mentor evaluations weakly, mostly absorbed by centralized review scores. AI-safety-org refs show a modest positive lift on mentor composite, but the marginal value over centralized review scores is small (joint R² = 0.18 vs review-only R² = 0.18). D3 results.
Research-taste Part 1 and Part 2 are substantially redundant (Spearman ρ ≈ 0.64). Considering simplification to a single-part test for 11.0 could save applicant time without losing much signal. D4 results.
Mentor evaluations are essentially a single "overall quality" factor. PC1 explains 60–70% of variance across cohorts; all four sub-dimensions load positively and evenly. Halo effect is real and consistent. Composite ≈ mean ≈ any individual dimension for predictive purposes. D5 results.
SRP/FRP × mentor-eval correlations are modest (and weaken over cohorts). 7.0: ρ ≈ +0.23; 8.0: +0.11; 9.0: +0.06. The 9.0 FRP rubric is most ambitious but least mentor-aligned — possibly measuring orthogonal program-quality dimensions. SRP/FRP is not a tight proxy for mentor signal. D6 results.

11.0 implications (tentative)

Show Stage-3 reviewers a simple "weak-signal agreement count" as a secondary view alongside the composite (D1). Cheap signal that captures the convergent validity intuition.
Talk to low-consistency streams (D2) — find out whether they're picking up signal the composite misses or ranking idiosyncratically. Per-cluster scoring (per A6) is the obvious follow-up.
Simplify the research-taste test — one part instead of two — based on Part 1 / Part 2 redundancy (D4). Saves applicant time.
Don't try to design dimension-specific predictive rubrics for mentor evaluations (D5). The halo effect means a single composite is what you're predicting, regardless of which dimension you target.
Re-evaluate the FRP rubric (D6) — if it's measuring something orthogonal to mentor signal, decide whether that's intentional (it's measuring program impact, not researcher quality) or whether it's drifted unhelpfully.

Individual reports

Analysis	Question	n
D1 — Convergent validity	Does signal-agreement predict ranking?	~604 Stage-3 empirical
D2 — Stream consistency	How closely does each stream track the composite?	25 streams with ≥5 ranked
D3 — Reference quality (9.0)	Do referee categories predict mentor evals?	86 joined (9.0 only)
D4 — Work test score analysis	Sub-score redundancy? Tier consistency?	394 RT takers, 82 PG takers
D5 — Mentor eval factor structure	One factor or many?	76–106 mentor evals per cohort
D6 — SRP/FRP signal value	Do SRP/FRP scores correlate with mentor evals and publications?	~50–90 per cohort joined

Errors encountered during Part D

None unrecovered. One in-flight: D4's policy/gov "grading tier" field turned out to be a dict containing rationale text, not a clean tier label — switched to comparing overall scores by ranked status instead.