Part D — Convergent & exploratory analyses

Context

Parts A, B, and C dug into the 10.0 process directly and validated findings across cohorts. Part D is more exploratory: convergent validity, within-stream consistency, factor structure, and the relationship between intermediate quality signals (SRP/FRP) and downstream outcomes. 6 analyses.

Headline findings

  1. Signal-agreement is real but doesn't dramatically beat the composite alone. An applicant scoring above-median on 4 of 5 weak signals (composite, CodeSignal, ToC, research-taste, AIS count) ranks at much higher rates than one with 0–1 above-median. But the agreement-count AUC (~0.68) is only marginally better than composite-alone. D1 results.
  2. Streams vary substantially in how closely they track the composite. Median ρ between composite and stream rank is ~0.20; some streams have ρ > 0.5 (closely tracking), others ρ < 0 (using orthogonal criteria). For low-ρ streams, per-attribute heatmaps reveal which attribute(s) the stream actually weights. D2 results.
  3. Referee categories predict mentor evaluations weakly, mostly absorbed by centralized review scores. AI-safety-org refs show a modest positive lift on mentor composite, but the marginal value over centralized review scores is small (joint R² = 0.18 vs review-only R² = 0.18). D3 results.
  4. Research-taste Part 1 and Part 2 are substantially redundant (Spearman ρ ≈ 0.64). Considering simplification to a single-part test for 11.0 could save applicant time without losing much signal. D4 results.
  5. Mentor evaluations are essentially a single "overall quality" factor. PC1 explains 60–70% of variance across cohorts; all four sub-dimensions load positively and evenly. Halo effect is real and consistent. Composite ≈ mean ≈ any individual dimension for predictive purposes. D5 results.
  6. SRP/FRP × mentor-eval correlations are modest (and weaken over cohorts). 7.0: ρ ≈ +0.23; 8.0: +0.11; 9.0: +0.06. The 9.0 FRP rubric is most ambitious but least mentor-aligned — possibly measuring orthogonal program-quality dimensions. SRP/FRP is not a tight proxy for mentor signal. D6 results.

11.0 implications (tentative)

Individual reports

AnalysisQuestionn
D1 — Convergent validityDoes signal-agreement predict ranking?~604 Stage-3 empirical
D2 — Stream consistencyHow closely does each stream track the composite?25 streams with ≥5 ranked
D3 — Reference quality (9.0)Do referee categories predict mentor evals?86 joined (9.0 only)
D4 — Work test score analysisSub-score redundancy? Tier consistency?394 RT takers, 82 PG takers
D5 — Mentor eval factor structureOne factor or many?76–106 mentor evals per cohort
D6 — SRP/FRP signal valueDo SRP/FRP scores correlate with mentor evals and publications?~50–90 per cohort joined

Errors encountered during Part D

None unrecovered. One in-flight: D4's policy/gov "grading tier" field turned out to be a dict containing rationale text, not a clean tier label — switched to comparing overall scores by ranked status instead.