LLM Evaluation’s Taphonomic Bias

Featured

I saw an article a couple of weeks ago arguing that LLM evals are too often binary: “did it complete the task or did it not?”. This is a known problem. A recent paper, “Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks” (Chandwani & Gupta, March 2026), notes that existing benchmarks like WebArena, SWE-bench, AgentBench, and GAIA predominantly use binary success metrics, even for complex multi-step tasks. Another paper, “Moving LLM evaluation forward: lessons from human judgment research” (Polonioli, 2025), argues for moving beyond binary success/failure judgments toward more nuanced, context-sensitive evaluation, drawing from cognitive-science work on bounded rationality.

So when a model goes from almost completing a task to fully completing it, the score jumps from zero to one. On SWE-bench, a patch that fixes seven of eight failing tests scores exactly the same as one that fixes none: zero. The improvement may have been gradual, but the metric makes it look sudden. And sudden jumps make better headlines, better leaderboards, and better funding pitches.

This reminds me of one of the criticisms to Eldredge and Gould’s theory of punctuated equilibria going around when I was at university. If we use the fossil record to establish that there are long periods of stasis and then very rapid speciation events, we might introduce bias.

I remember from palaeontology lectures the logic applied to bivalves oand brachiopds. The argument went: if their evolution and speciation are evaluated based on visible changes in the hinges, the ridging and the “teeth” where the valves interlocked, this very likely ignored the possibility of very gradual changes happening in the soft tissue over time, tendons and muscles that kept the two valves together. A change in our metric for evaluating this might have changed completely our theory on a specific filum: was it anagenesis or cladogenesis. But soft tissues are not preserved in the vast majority of cases. Hence the potential bias.

The fossil record doesn’t choose to discard soft tissue; it’s a consequence of preservation physics. Similarly, binary benchmarks don’t intend to hide gradual progress; it’s a consequence of how the scoring function was designed. But in both cases, the bias in the record gets reified into a theory about the underlying process.

Here’s where the analogy bites hardest: for bivalves or brachiopods, we genuinely cannot recover the soft tissue, the information is destroyed. For LLMs, we could instrument graded metrics but mostly haven’t bothered. We’re imposing a taphonomic bias on ourselves, by choice.

Why? Binary metrics produce cleaner narratives for press releases, leaderboard drama, and funding pitches. “We jumped 12 points on SWE-Bench” is a story. “Our graded partial-completion score improved 3% across 47 subtask categories” is not. The incentive structure actively selects against building the soft-tissue instrumentation, even though we could.

Claudogenesis. original sketch by the author (with lots of artistic and paleontologic license).

Citations: Chandwani, A. & Gupta, I. (2026). arxiv.org/abs/2603.22744 – Polonioli, A. (2025). pmc.ncbi.nlm.nih.gov/articles/PMC12149859/