I saw an article a couple of weeks ago arguing that LLM evals are too often binary: “did it complete the task or did it not?”. This is a known problem. A recent paper, “Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks” (Chandwani & Gupta, March 2026), notes that existing benchmarks like WebArena, SWE-bench, AgentBench, and GAIA predominantly use binary success metrics, even for complex multi-step tasks. Another paper, “Moving LLM evaluation forward: lessons from human judgment research” (Polonioli, 2025), argues for moving beyond binary success/failure judgments toward more nuanced, context-sensitive evaluation, drawing from cognitive-science work on bounded rationality.
So when a model goes from almost completing a task to fully completing it, the score jumps from zero to one. On SWE-bench, a patch that fixes seven of eight failing tests scores exactly the same as one that fixes none: zero. The improvement may have been gradual, but the metric makes it look sudden. And sudden jumps make better headlines, better leaderboards, and better funding pitches.
This reminds me of one of the criticisms to Eldredge and Gould’s theory of punctuated equilibria going around when I was at university. If we use the fossil record to establish that there are long periods of stasis and then very rapid speciation events, we might introduce bias.
I remember from palaeontology lectures the logic applied to bivalves oand brachiopds. The argument went: if their evolution and speciation are evaluated based on visible changes in the hinges, the ridging and the “teeth” where the valves interlocked, this very likely ignored the possibility of very gradual changes happening in the soft tissue over time, tendons and muscles that kept the two valves together. A change in our metric for evaluating this might have changed completely our theory on a specific filum: was it anagenesis or cladogenesis. But soft tissues are not preserved in the vast majority of cases. Hence the potential bias.
The fossil record doesn’t choose to discard soft tissue; it’s a consequence of preservation physics. Similarly, binary benchmarks don’t intend to hide gradual progress; it’s a consequence of how the scoring function was designed. But in both cases, the bias in the record gets reified into a theory about the underlying process.
Here’s where the analogy bites hardest: for bivalves or brachiopods, we genuinely cannot recover the soft tissue, the information is destroyed. For LLMs, we could instrument graded metrics but mostly haven’t bothered. We’re imposing a taphonomic bias on ourselves, by choice.
Why? Binary metrics produce cleaner narratives for press releases, leaderboard drama, and funding pitches. “We jumped 12 points on SWE-Bench” is a story. “Our graded partial-completion score improved 3% across 47 subtask categories” is not. The incentive structure actively selects against building the soft-tissue instrumentation, even though we could.
Claudogenesis. original sketch by the author (with lots of artistic and paleontologic license).
A shorter version of this post appeared on LinkedIn on March 25, 2026. This version includes additional prose, figures, and a postscript on a conversation it sparked.
In a previous LinkedIn article, I pointed my BS detector at AI news. This time I’m pointing it at my own AI.
The numbers turned out not to be directly comparable — different measurement methods — but the question still stood. Whatever the exact percentages, they’re averages across the whole genome. And averages hide a lot.
That led me into incomplete lineage sorting — the fact that if you line up human, chimp, and gorilla DNA and compare it piece by piece, about 30% of the genome tells a different evolutionary story than the species tree (Scally et al. 2012). Well established science. Notoriously hard to explain.
I asked Claude: “so the genome and speciation diagrams do not overlap?” It responded with an analogy I hadn’t asked for — dealing cards from a deck.
Seven gene variants in the ancestral population. Gorillas split off first — first deal. Some cards go to gorilla, the human-chimp ancestor keeps others, some go to both. Second deal splits human from chimp. Three players, overlapping hands. Compare: Card D went to human and chimp but not gorilla — species tree. Card C went to human and gorilla, skipping chimp — contradicts it. Card E went to chimp and gorilla, skipping human.
I said “make me a diagram.” No specs. Claude produced the figure below.
Two cosmetic tweaks from me afterward. Everything else — concept, layout, card naming — is Claude’s.
So: is this actually original?
That’s a claim worth auditing. I ran it through the same framework I built for the Pentagon/ChatGPT post.
Step 1: Search. I asked Claude to search for prior card-dealing analogies for ILS. It found analogies using M&Ms (coloured candies sorted into jars — Avian Hybrids, 2022) and Pachinko machines (marbles through pegs — The G-cat, 2021). No cards. Those analogies explain random sorting of identical items into bins. The card version does something different: distinct identities per variant, two sequential deals, and a colour-coded punchline mapping to the three topologies.
Step 2: Audit the search itself. An LLM claiming “I didn’t find it” is not the same as “it doesn’t exist.” So I ran a Fermi sanity check on the search coverage. ILS is a niche topic — maybe 50 published explainers total, of which maybe 6 use any analogy at all. The keywords “card,” “deck,” “dealt” are highly distinctive in evolutionary biology. If a card analogy existed in any indexed source, five independent searches would almost certainly surface it. Probability of missing it in searchable literature: ~3%.
Step 3: Check the limits. Web search doesn’t reach textbook interiors, lecture slides, or classroom analogies. Someone may have used cards to explain ILS on a whiteboard in 2004. I can’t rule that out, and I shouldn’t claim to.
Step 4: Independent verification. I ran the claim through a separate Claude instance with extended thinking, using the full bullshit-detector framework — source verification, Fermi sanity check on search coverage, logical fallacy scan. Verdict: claim passes for searchable literature. The key flag: watch for equivocation on “published.” Web-indexed is not the same as “ever conceived.”
Verdict: “No published precedent found in searchable literature” is defensible. “First ever” is not. That distinction matters — it’s the same denominator hygiene from the Pentagon post. Know what your evidence covers and don’t claim more.
There’s a sharper version of the pattern-matching hypothesis worth naming. M&Ms were almost certainly in Claude’s training data. The move from “identical items sorted randomly into bins” to “distinct cards dealt sequentially to named players” is exactly the kind of transformation usually attributed to Picasso — “great artists steal” — though the line is almost certainly T.S. Eliot’s, who said it first, said it better, and meant something more precise: that the good poet welds the theft into something utterly different from the source. If that’s what happened here, the output is novel but the mechanism isn’t a jump — it’s a steal from a prior analogy in the same domain. I can’t rule it out. The audit covers what’s published, not what’s in the weights.
The direct precedent in the research literature is Figure 5 of Rivas-González et al. (2024) — topology posterior probabilities along a single chromosome. Rigorous work. Also very hard to read if you don’t already know population genetics.
Figure 5B from Rivas-González et al. 2024, PLOS Genetics 20(2):e1010836, CC BY 4.0. A segment of chromosome 1 showing ILS levels and coalescent depths across genomic windows. Three tracks, two colour scales, one chromosome. And this is the simplified view!
I also tried building a simplified whole-genome chromosome painting — same colour scheme as the card, synthetic data matched to published proportions, multiple sorting strategies — but it still required considerable mental effort to read. Far from the card analogy.
Illustrative chromosome painting of incomplete lineage sorting across all 23 human chromosomes (1–22 plus X). Each horizontal bar is one chromosome, divided into 100 kb windows and colour-coded by which phylogenetic topology wins in that region: gold for human–chimp (the species tree, ~63%), green for human–gorilla (~18.5%), purple for chimp–gorilla (~18.5%). Generated with synthetic data matched to published genome-wide proportions; spatial clustering is representative, not derived from specific genomic coordinates. Compare with the card figure above: the same information, but at chromosome scale the signal dissolves into noise. The card analogy works because it operates at the level of mechanism, not data.
The first thing out of the conversation turned out to be the best thing. And the honest audit of the originality claim turned out to be more interesting than just asserting it.
What’s your take — does this count as genuine originality, or is it sophisticated pattern-matching that happens to land on something no one published before? There’s a broader debate running on this — whether LLMs are structurally capable of abduction, the kind of jump Einstein described from sensory experience to novel axioms, or whether they’re confined to induction and deduction no matter how fluent they look. I don’t have a settled answer. I’d be curious where you draw the line.
Postscript
On the same day I posted this on LinkedIn, I came across a post by Tom Chatfield — author and thinker on technology and language — about the “no true Scotsman” fallacy in discussions of AI creativity. The pattern he described: machines produce striking outputs, people dismiss them with “it can’t be genuinely creative because a machine made it.” The dismissal protects the category by rejecting inconvenient evidence rather than examining it.
The timing was coincidental. The overlap was too good to ignore, so I commented with the card analogy as a concrete test case. Chatfield’s reply singled out the phrase “dismissing the output to protect the category” and said his instinct was to investigate rather than gatekeep — to ask what kinds of creativity are at work and how they intersect with human learning.
That instinct is where the interesting conversation lives. The card analogy is a good test case because the output is concrete enough to examine: what did the model actually do, what’s novel about it, what isn’t? Much more productive than arguing about whether to award the word “creative.”
And does it matter?
The AlphaGo comparison is worth raising. Move 37 in the second game against Lee Sedol is widely accepted as a creative act — it violated every human prior and it won. The creativity claim has a ground truth. LLM outputs don’t. The card analogy can’t be verified against a scoreboard, which leaves room for dismissal that Move 37 never faced. But that’s a different argument from “it can’t be creative because a machine made it.” Worth keeping the two separate.
References: Scally et al. 2012, Nature 483:169; Rivas-González et al. 2023, Science 380:eabn4409; Rivas-González et al. 2024, PLOS Genetics 20(2):e1010836. M&M analogy: Avian Hybrids blog, 2022 (avianhybrids.wordpress.com). Pachinko analogy: The G-cat blog, 2021 (theg-cat.com).
Card analogy and figure: Claude (Opus 4.6), unprompted during conversation. Provenance verified against conversation transcript with line numbers. Originality audited with bullshit-detector framework (pip install bullshit-detector).