B7 — Pangram / AI-detection

Context

Pangram is an AI-text-detection tool. 10.0 ran applicant free-text fields through Pangram and stored a fraction_ai score per field (0 = human-written, 1 = high confidence AI-generated, -1 = not analyzed / not enough text). We can use these scores to ask: how prevalent is AI-generated text in 10.0 applications? Does it correlate with worse Stage-2 scores? With lower P(ranked)? Or do reviewers seem to be (implicitly or explicitly) discounting it?

MATS (Machine Alignment, Transparency & Security) is an AI safety research fellowship that places ~120 fellows with ~100 mentors per cohort. Cohort 10.0 ran in summer 2026 and was the first cohort with a centralized application review instead of decentralized stream-specific review. This analysis is part of a broader effort to evaluate the 10.0 process and inform the design of 11.0 (autumn 2026).

How the 10.0 selection pipeline worked (click to expand)

The 10.0 pipeline in brief. ~2,200 people applied. Each applicant went through three stages:

Stage 1 — submitted background / experience / motivation, picked which research tracks they were interested in (Empirical, Policy & Strategy, Technical Governance, Theory, Compute Infrastructure). An LLM screen filtered out applicants who clearly didn't meet a minimum bar, and produced advisory per-stream recommendations.
Stage 2 — applicants who passed Stage 1 had their materials scored by LLM-graded rubrics. The empirical track used a composite score combining Research Skills, Technical Execution (split into MLE, SWE, Math sub-scores), and Soft Skills. The top ~600 by composite advanced to Stage 3.
Stage 3 — applicants chose specific mentors / "streams" to apply to. Each stream reviewed its applicants and produced a ranked list. Top-ranked applicants got offers; lower-ranked got waitlisted. ~120 offers were made.

For the empirical track, the composite formula is 0.50·RS + 0.35·TE + 0.15·SS, where TE = 0.50·MLE + 0.30·SWE + 0.20·Math. A "relevance multiplier" (Direct=1.0 / Adjacent=0.85 / Distant=0.60) is applied to Research Skills based on how the applicant's experience matches the streams they applied to.

Outcome definitions (click to expand)

Outcome definitions used throughout these analyses:

is_ranked (primary outcome) — applicant was ranked by ≥1 stream. This is the cleanest signal of "the selection process picked this person." Not the same as "received an offer" — offer count is bounded by cohort size (~120), but rank count reflects quality independently of capacity.
is_invited_to_worktest (secondary outcome) — applicant was engaged by ≥1 stream in any way: invited to a work test, invited to an interview, ranked, or sent the Megastream takehome. Strict superset of is_ranked. One level above is_ranked in the funnel.
passed_mentors_bar — applicant was offered or waitlisted. In 10.0, this equals is_ranked exactly (every ranked person got either an offer or a waitlist slot).

What Pangram scored

The ToC reasoning text (highest coverage — ~2,200 applicants)
Empirical/Policy&Strategy/Technical-Governance long-form question A/B
Writing samples 1 and 2 (for applicants who submitted them)

For each applicant we compute max fraction_ai across all analyzed fields — i.e., the highest AI-detection score they hit on any single field. -1 (not analyzed) is treated as missing.

Caveat — interpretation

Pangram is a model. A 'high fraction_ai' score is not a confession; it's an estimate. False positives exist, especially for non-native-English writers who use ChatGPT or Grammarly for editing rather than generation. We don't treat fraction_ai as ground truth — just as an applicant-level signal.

Headline

Of 2,104 applicants with any Pangram-analyzed text, 772 (37%) had at least one field flagged as AI-generated (max fraction_ai ≥ 0.9).

Higher Pangram scores correlate with worse outcomes, but modestly: - Max fraction_ai → is_ranked (full pool, n=2,104): AUC = 0.574 [0.539, 0.607]. - Max fraction_ai → composite score (Spearman): ρ = -0.090. - In the Stage-3 empirical pool (n=765): AUC = 0.552 [0.510, 0.591]; Spearman ρ with composite = -0.028.

The Stage-3 effect is smaller — most of the Pangram-related signal gets absorbed earlier in the pipeline.

Per-field Pangram coverage

ToC reasoning Pangram bucket → outcomes

The ToC reasoning text has the broadest coverage (~2,200 applicants). Bucketing into 4 fraction_ai bands:

Distribution of max fraction_ai

The distribution is heavily bimodal: many applicants score 0 (clean human writing on every field) or 1 (high AI detection somewhere).

Field	n analyzed	Mean fraction_ai	% with ≥0.9
Reasoning for top 3 (fraction_ai)	1,977	0.22	21.8%
Writing sample 1 (fraction_ai)	378	0.16	7.1%
Writing sample 2 (fraction_ai)	307	0.19	11.7%
Empirical option A (fraction_ai)	516	0.38	38.0%
Empirical option B (fraction_ai)	1,072	0.33	33.4%
Policy & Strategy option A (fraction_ai)	116	0.42	42.2%
Policy & Strategy option B (fraction_ai)	291	0.44	44.0%
Technical Governance option A (fraction_ai)	111	0.51	51.4%
Technical Governance option B (fraction_ai)	236	0.52	50.8%

Bucket	n	P(ranked)	Mean composite
0 (human)	1,546	9.7%	1.69
low (0.1–0.5)	1	0.0%	0.00
high (≥0.9)	430	5.1%	1.47

B7 — Did AI-detected text in applications hurt outcomes?

Context

Headline

Per-field Pangram coverage

ToC reasoning Pangram bucket → outcomes

Distribution of max fraction_ai

Takeaways