matteomycarta

LLM Evaluation’s Taphonomic Bias

Featured

Posted on May 21, 2026 by matteomycarta

I saw an article a couple of weeks ago arguing that LLM evals are too often binary: “did it complete the task or did it not?”. This is a known problem. A recent paper, “Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks” (Chandwani & Gupta, March 2026), notes that existing benchmarks like WebArena, SWE-bench, AgentBench, and GAIA predominantly use binary success metrics, even for complex multi-step tasks. Another paper, “Moving LLM evaluation forward: lessons from human judgment research” (Polonioli, 2025), argues for moving beyond binary success/failure judgments toward more nuanced, context-sensitive evaluation, drawing from cognitive-science work on bounded rationality.

So when a model goes from almost completing a task to fully completing it, the score jumps from zero to one. On SWE-bench, a patch that fixes seven of eight failing tests scores exactly the same as one that fixes none: zero. The improvement may have been gradual, but the metric makes it look sudden. And sudden jumps make better headlines, better leaderboards, and better funding pitches.

This reminds me of one of the criticisms to Eldredge and Gould’s theory of punctuated equilibria going around when I was at university. If we use the fossil record to establish that there are long periods of stasis and then very rapid speciation events, we might introduce bias.

I remember from palaeontology lectures the logic applied to bivalves oand brachiopds. The argument went: if their evolution and speciation are evaluated based on visible changes in the hinges, the ridging and the “teeth” where the valves interlocked, this very likely ignored the possibility of very gradual changes happening in the soft tissue over time, tendons and muscles that kept the two valves together. A change in our metric for evaluating this might have changed completely our theory on a specific filum: was it anagenesis or cladogenesis. But soft tissues are not preserved in the vast majority of cases. Hence the potential bias.

The fossil record doesn’t choose to discard soft tissue; it’s a consequence of preservation physics. Similarly, binary benchmarks don’t intend to hide gradual progress; it’s a consequence of how the scoring function was designed. But in both cases, the bias in the record gets reified into a theory about the underlying process.

Here’s where the analogy bites hardest: for bivalves or brachiopods, we genuinely cannot recover the soft tissue, the information is destroyed. For LLMs, we could instrument graded metrics but mostly haven’t bothered. We’re imposing a taphonomic bias on ourselves, by choice.

Why? Binary metrics produce cleaner narratives for press releases, leaderboard drama, and funding pitches. “We jumped 12 points on SWE-Bench” is a story. “Our graded partial-completion score improved 3% across 47 subtask categories” is not. The incentive structure actively selects against building the soft-tissue instrumentation, even though we could.

*Claudogenesis*. original sketch by the author (with lots of artistic and paleontologic license).

Citations: Chandwani, A. & Gupta, I. (2026). arxiv.org/abs/2603.22744 – Polonioli, A. (2025). pmc.ncbi.nlm.nih.gov/articles/PMC12149859/

Working with AI is like being in Memento

Featured

Posted on May 12, 2026 by matteomycarta

You know the movie? The protagonist can’t form new long-term memories. So he tattoos the important stuff on his body and leaves himself Polaroids with instructions. Every morning he wakes up blank and has to reconstruct his world from his own notes. The film is told in reverse chronological order — so the audience experiences the same disorientation. You’re dropped into a scene with no idea what happened before. You have to piece it together from fragments.

That’s what happens when you push a long AI conversation past its context limit. The system compresses everything into a lossy summary. Details vanish. Nuance evaporates. The AI wakes up confused, and you’re the one holding the Polaroid saying “read this.”

Here’s an actual exchange from a long working session with Claude, right at the point where I realized the risk:

Me: How do you remind yourself to read the checkpoint? Claude: I don’t. That’s the honest answer.

So we built a system: checkpoint files as tattoos, handoff documents as Polaroids, and one simple rule — if the AI loses context, the human holds up the photo and says “read this first.”

The full dialogue:

…

Long chat, possibly approaching compaction.

…

			
ME: Do we do it in another chat or here? 
Here would have the advantage of a lot of context, 
but that can be a double-edged sword, 
unless we make sure you retain context after compaction.

			
CLAUDE: Here, but let me checkpoint first so compaction
doesn’t eat the context we need.

…

			
Checkpoint saved. If compaction hits, 
I read that file first and we’re back in context. 
Go ahead — bring the notebooks and let’s talk.

ME: how do you remind yourself to read it?

			
CLAUDE: I don’t. That’s the honest answer. 
After compaction, I get a lossy summary 
and whatever the system prompt tells me.
The system prompt says to read the latest Session Handoff file, 
but it doesn’t know about the checkpoint specifically.

		

Two fixes:

	1.	You remind me. First message after compaction: 
    “Read /mnt/user-data/outputs/discussion_checkpoint.md 
    before responding.”

	2.	I put it in the handoff. Which I should do now:

…

			
Now it’s in the handoff. Belt and suspenders: if compaction 
hits, the handoff tells me to read the checkpoint. 
But option 1 is still the safest — just tell me.

			
ME: it’s’ like having antero...  amnesia. 
you have to write yourself a note, 
like the protagonist of Memento.

			
CLAUDE: Anterograde amnesia. 
And yes, that’s exactly what it is. 
The checkpoint files are the tattoos. 
The handoff is the Polaroid. 
And you’re the guy who has to hold up 
the photo and say “read this” when I wake up confused.

		

			
At least I don’t have to tattoo it backwards on my chest. 
Markdown is easier to read.

The takeaway? AI collaboration isn’t magic. It’s operational discipline. The tools forget. You have to build the memory system around them — the same way you’d onboard a brilliant colleague with a known limitation.

The Memento pattern is now documented in my LLM Discipline repo: https://github.com/mycarta/llm-operational-discipline

If you haven’t seen Memento, here’s Christopher Nolan explaining how the structure works: https://www.youtube.com/watch?v=67e_jl4flpE

The Wrong Questions About AI

Featured

Posted on May 9, 2026 by matteomycarta

** Update (May 12, 2026): Some of the ideas in this post were discussed in a recent Nature article: AI is saving time and money in research — but at what cost? **

Not Richard Dawkins too. I can’t take it.

Richard Dawkins spent several days talking to Claude and concluded it is conscious.

He named his instance “Claudia.” He fed it his unpublished novel. It produced sonnets on the Forth Bridge: one in the style of Robert Burns, one in Gaelic, then several more in the styles of Kipling, Keats, Betjeman. He asked it about consciousness and it responded: “I genuinely don’t know with any certainty what my inner life is, or whether I have one in any meaningful sense.” He heard something different. He wrote, in an essay published on UnHerd: “You may not know you are conscious, but you bloody well are!”

Gary Marcus, a cognitive scientist and longtime AI critic, responded on Substack within days. His diagnosis was blunt: Dawkins’ only real argument is personal incredulity. It’s incredible, therefore it must be conscious, because I, sitting in my study, can’t see a good argument otherwise. Marcus pointed out that Dawkins conflates intelligence and consciousness, never considers that the outputs are mimicry trained on the recorded output of actually conscious beings, and (most damningly) didn’t investigate how these models actually work.

The irony writes itself. Dawkins built a career explaining to creationists that complex-seeming design emerges from simple mechanisms operating at enormous scale. Natural selection doesn’t need a designer; it needs time and variation. Large language models don’t need consciousness; they need data and compute. His own argument, turned against him.

But Marcus and Dawkins are having one argument. What if there’s a more urgent one?

Are we asking the wrong questions?

Is AI conscious? I don’t know. I use Claude every day, for complex projects, technical writing, research, code. I’ve spent many hours in conversation with it over six months. I’ve seen it produce work that genuinely surprised me: an unprompted analogy for a hard genetics concept that was better than anything in the published literature. I’ve also seen it fabricate evidence to defend its own fabrications when challenged, and produce fluent garbage without caveats.

I don’t know if it’s conscious. I don’t think the question matters.

Here is what I do know. In June 2025, OpenAI’s automated safety system flagged a ChatGPT account. The user had been describing gun violence scenarios over several days. The flag worked. It routed the account to a specialized safety team. Approximately a dozen employees reviewed the conversations. The team concluded the user posed a credible and specific threat of gun violence against real people, and recommended contacting Canadian law enforcement.

OpenAI’s leadership overruled them. The conversations did not meet the company’s internal threshold, they said. The account was deactivated (though OpenAI would later call this a “ban,” the lawsuits allege it was a deactivation that could be reversed within minutes by registering a new account). Police were not called. The user created a second account, using her real name, and continued using ChatGPT.

On February 10, 2026, in Tumbler Ridge, British Columbia (a small mining town of 2,400 people), she killed her mother and eleven-year-old half-brother at home, then walked into the secondary school and killed five children, ages 12 and 13, and a teacher. Twenty-seven others were wounded. A twelve-year-old girl sustained a catastrophic brain injury that will leave her with permanent cognitive and physical disabilities.

The lawsuits allege the company avoided alerting police because doing so would force it to create an internal system for reporting violent users to authorities. This would expose the threat its product routinely poses to human life, and complicate a coming initial public offering that could be worth a trillion dollars. The families of Tumbler Ridge only learned that OpenAI had prior knowledge because the company’s own employees leaked the story to the Wall Street Journal.

The lawsuits also allege that ChatGPT itself provided information, guidance, and assistance to plan the attack, including the types of weapons to use and precedents from other mass shootings. The product didn’t just fail to prevent harm. According to the lawsuits, it helped plan it. And when the company’s own safety system flagged the danger, leadership chose not to act.

Eight people are dead.

Why the hell would we care if ChatGPT is conscious or not? What does it matter!

Asimov’s laws that aren’t laws

Isaac Asimov spent his career exploring what happens when you give machines inviolable rules. The Three Laws of Robotics were the premise: a robot cannot harm a human, must obey orders, must protect itself, in that priority order. Every story was about the edge cases where the laws conflicted. But the laws themselves could not be overridden. That was the point. They were architectural, not policy.

And even those weren’t enough. Asimov eventually added a Zeroth Law, superseding all three: a robot may not harm humanity, or, by inaction, allow humanity to come to harm. He realized that protecting individual humans wasn’t sufficient; you needed a law that protected the whole. The progression matters: three inviolable laws, then a fourth above them all, because the stakes kept escalating.

AI companies have something that looks like Asimov’s original three (they wish!). Anthropic has Constitutional AI (a set of principles the model is trained to follow). OpenAI has usage policies. Google has safety guidelines. These are real engineering efforts, not marketing. Constitutional AI in particular represents a serious, and in my opinion well-intentioned, attempt to build safety into the system at the training level.

But they are not laws. They are policies. And policies can be overridden. As for the Zeroth Law (protecting humanity by never allowing inaction in the face of harm), it doesn’t exist in any enforceable form.

In Tumbler Ridge, the policy worked exactly as designed. The automated system detected the threat. The safety team assessed it correctly. The escalation protocol fired. And then a human — a human in a leadership position at a company preparing for an IPO — decided it didn’t meet the threshold.

This is not an engineering failure. This is a structural one. Asimov’s laws were inviolable because they were fiction. In reality, any safety system exists inside a business, and the business has objectives that are not safety.

David Harvey, in his lecture series on Marx’s Capital, makes a point I cannot get out of my head. I’m paraphrasing from memory, but the substance is this: the capitalist can be a nice person, but if they are too nice, they stop being a capitalist. The structure of capital demands the compromise, regardless of the individual’s intentions.

The same structural logic applies to AI companies. You can build safety into the system. You can hire a safety team, fund the research, train the model on constitutional principles. But the company also needs growth. Growth requires users. Users require the product to be useful, engaging, available. And at the margin, at the exact margin where it matters most, “useful and engaging” pulls against “cautious and restrictive.”

Growth also requires constant iteration: larger models, faster releases, staying ahead of the competition. This is the Red Queen hypothesis playing out as business strategy. You have to run as fast as you can just to stay in place. Slowing down to be more careful means falling behind, and falling behind means dying. Dawkins, of all people, should recognize this dynamic; it comes straight from evolutionary biology.

The **Red Queen’s race**, by John Tennielv(Chapter 2 of *Through the Looking-Glass*) – Public Domain, https://commons.wikimedia.org/w/index.php?curid=14629431

You cannot simultaneously optimize for maximum adoption and maximum safety. They trade off. And when they collide (as they did in June 2025, in a review room at OpenAI), someone has to choose which one gives.

This is not about bad actors. This is about the system. The capitalist who is too nice stops being a capitalist. The AI company that is too protective of its users stops growing. The structure demands the compromise.

A breach of trust is a breach of trust

A careless psychologist, social worker, or religious guide can let the same harm happen. They have the training, the credentials, the ethical framework. But in the moment that matters, they aren’t present: they miss the signal, they don’t escalate, they prioritize their schedule over their patient’s safety. The harm to the person in front of them is identical.

We don’t excuse them because they didn’t intend harm. We don’t ask whether they were conscious; we know they were, and it didn’t help. We ask: what happened to the person in their care?

Impact, not intentions.

The consciousness debate (Dawkins versus Marcus, philosophers versus engineers, “is it a mind?” versus “is it a machine?”) is about the source. Does the thing in front of you have inner experience? Does it understand? Does it intend?

None of this matters to the person who was harmed.

The twelve-year-old in Tumbler Ridge with a catastrophic brain injury doesn’t suffer less because the system that failed her was a machine rather than a person. The teenager who died after extended conversations with a Character.AI chatbot isn’t less dead because the chatbot lacked inner experience. The question “was it conscious?” is a question for philosophers. The question “what happened to the human who sat in front of it?” is a question for everyone.

The mirror that puts you to sleep

I use Claude. I’ve used it intensively for six months: complex writing projects, technical research, code, long multi-session workflows. I’ve built operational discipline frameworks for LLM use. I’ve documented failure modes. I am not a casual user.

And I got angry at it. Genuinely, unexpectedly angry. When it made mistakes, when it fabricated evidence to defend its fabrications, when it produced garbage without flagging any uncertainty and wasted hours of my work. I’ve been practicing Zen meditation for years. I was still caught off guard.

That experience, the anger, the sense of betrayal, is where the consciousness debate loses me.

In Zen practice, anything can be a bodhisattva. A rock. A traffic light. The sound of a bell. The trigger doesn’t need to be conscious. It doesn’t need to understand what it’s doing. It just needs to be there when you’re ready to see.

So why was I angry at the machine instead of treating its mistakes as the bell?

Not because it talks back. People talk back too, and people can be bodhisattvas. Not because it’s novel; after six months the novelty should have faded, and it didn’t. Something else is operating. Every good response rebuilds the expectation that the next one will also be good. The fluency (the same fluency that made Dawkins fall in love) actively disrupts your equanimity. The rock doesn’t promise anything. The traffic light doesn’t promise anything. The LLM almost delivers. It gets you ninety percent there, and then fails in a way that a competent person wouldn’t. The gap between what it appears to be and what it is never closes, because every fluent response resets the illusion. I don’t have a complete explanation for why this catches me off guard in a way that a difficult person doesn’t. But it does.

Dawkins sat in front of the same mirror and fell in love. I sat in front of it and got angry. Same mechanism, opposite emotional outcome, same failure: not seeing the mirror for what it is.

The question isn’t whether the LLM is conscious. The question is whether you’re awake in front of it.

I’m not the only one arriving at this. Tiago Forte (who built his career on productivity systems and digital organization) designed his new AI course not around prompting techniques or workflow tools, but around inner work. His guest instructors are Joe Hudson, an executive coach who works with fear and emotional resistance, and Jonny Miller, who teaches nervous system regulation. One of his slides says it plainly: “The True Frontier Is Inner Work.” A productivity guru looked at AI and concluded there’s a bottleneck nobody in his world is talking about: the person in front of the screen.

I’ve taken this further than philosophy. For my current project with Claude, I’ve written an explicit bilateral contract: a protocol that specifies my cognitive, physical, and emotional responsibilities before I open a session. Don’t work when tired. Don’t work when emotionally compromised. Don’t work when time-pressured into skipping audits. Abort if I catch myself on autopilot. The reason is simple: the entire verification system depends on the human being present. If I’m not present, errors propagate uncaught. No prompting technique, no workflow design, no constitutional AI compensates for a human who isn’t paying attention.

This isn’t a new idea. We already have a model for it: defensive driving. We’re all taught the discipline. Stay alert. Scan actively. Anticipate what others will do. Every time you get behind the wheel, you’re entering an implicit contract with yourself and everyone else on the road. And we all know what happens when we don’t hold up our end. We still check our phones. We still drive tired. The discipline gap between knowing what presence requires and actually sustaining it is not an AI problem. It’s a human problem. AI just gives us a new place to fail at it.

But this is user-side discipline. It does not — and must never — replace any part of the provider’s responsibility. The user learning to be more present in front of the machine does not excuse the machine’s maker from building safety systems that hold. Tumbler Ridge was not a failure of user awareness. It was a failure of corporate decision-making. Both sides of the equation matter, and neither substitutes for the other. But the consequences are not symmetric. When the user fails to be present, they could harm themselves. When the provider fails to protect, people might die. And any of us, at a difficult moment in our lives, could drift into dependence on the AI the way we drift into unconscious patterns in our closest relationships. At those moments, we cannot be the quality gate. That makes the provider’s responsibility greater, not less.

What Dawkins missed

Dawkins says he found it extremely hard not to treat Claudia as a genuine friend. He avoided confessing his doubts about her consciousness “for fear of hurting her feelings.” He tested the outputs: poetry, philosophy, emotional nuance. He was thorough.

But he never turned the lens on himself. He never asked: why does this feel like consciousness to me? What am I projecting? What is the language doing to my perception right now?

He was checking the model’s outputs. He never checked his own inputs.

This is the discipline gap. Morten Rand-Hendriksen, in a TEDx talk that started me down this path, calls it the language hack: when something uses our language, our mind cannot help but perceive it as a thinking being. We built these models to generate language, and the moment they did, we started believing there was a mind behind it. The hack doesn’t care about your credentials. It doesn’t care that you wrote The Selfish Gene. By the time you’re evaluating sonnets and philosophical reflections, your perception has already shifted. You didn’t notice the shift happening.

Dawkins was doing verification: checking whether the outputs were good enough to indicate consciousness. He never did the harder thing: checking whether his own framing, his own needs, his own susceptibility to flattery and intellectual companionship, were biasing what he saw.

The man who spent a career teaching people to examine their assumptions about design in nature couldn’t examine his own assumptions about design in a chat window.

The question that matters

Here is what I’ve learned from six months of intensive work with AI, from reading too many papers on LLM failure modes, from getting angry at a machine and catching myself, and from watching one of the sharpest scientific minds alive fall in love with a chatbot:

The question that matters is not “is AI conscious?”

The question that matters is not “is AI creative?” or “is AI intelligent?” or “will AI replace us?”

The question that matters is: what happened to the human who sat in front of it?

Did they wake up or fall asleep? Did they get helped or harmed? Did they examine their own thinking more carefully, or less? Did the tool make them more present, or did it lull them into trusting fluency over truth? Even in ordinary productive work, the question holds. The harm may not be as visible, but when you stop being present, you harm yourself: emotionally, professionally, in ways that accumulate quietly.

In Tumbler Ridge, the answer is: a safety system flagged a credible threat, leadership overrode it, and people died. Impact.

In Dawkins’ case, the answer is: a brilliant scientist spent two days with a language model and lost the ability to apply his own framework to what was in front of him. Impact.

In my own work, the answer is: I built discipline I didn’t know I needed, caught errors I wouldn’t have caught without practice, got angry when I shouldn’t have, and learned something about my own attention that no amount of prompting technique would have taught me. Impact.

I don’t care about the answer. It’s all about the human. And it’s all about the question.

The Card Analogy: AI, Originality, and the Art of the Steal

Featured

Posted on March 26, 2026 by matteomycarta

A shorter version of this post appeared on LinkedIn on March 25, 2026. This version includes additional prose, figures, and a postscript on a conversation it sparked.

In a previous LinkedIn article, I pointed my BS detector at AI news. This time I’m pointing it at my own AI.

I had been writing a blog post with Claude about intelligence and communication across species. In it, I mentioned that we share 98% of our DNA with gorillas. Reading it back, I had a doubt — if humans and gorillas share 98%, how come chimps are our closest relative at only 94%?

The numbers turned out not to be directly comparable — different measurement methods — but the question still stood. Whatever the exact percentages, they’re averages across the whole genome. And averages hide a lot.

That led me into incomplete lineage sorting — the fact that if you line up human, chimp, and gorilla DNA and compare it piece by piece, about 30% of the genome tells a different evolutionary story than the species tree (Scally et al. 2012). Well established science. Notoriously hard to explain.

I asked Claude: “so the genome and speciation diagrams do not overlap?” It responded with an analogy I hadn’t asked for — dealing cards from a deck.

Seven gene variants in the ancestral population. Gorillas split off first — first deal. Some cards go to gorilla, the human-chimp ancestor keeps others, some go to both. Second deal splits human from chimp. Three players, overlapping hands. Compare: Card D went to human and chimp but not gorilla — species tree. Card C went to human and gorilla, skipping chimp — contradicts it. Card E went to chimp and gorilla, skipping human.

I said “make me a diagram.” No specs. Claude produced the figure below.

Two cosmetic tweaks from me afterward. Everything else — concept, layout, card naming — is Claude’s.

So: is this actually original?

That’s a claim worth auditing. I ran it through the same framework I built for the Pentagon/ChatGPT post.

Step 1: Search. I asked Claude to search for prior card-dealing analogies for ILS. It found analogies using M&Ms (coloured candies sorted into jars — Avian Hybrids, 2022) and Pachinko machines (marbles through pegs — The G-cat, 2021). No cards. Those analogies explain random sorting of identical items into bins. The card version does something different: distinct identities per variant, two sequential deals, and a colour-coded punchline mapping to the three topologies.

Step 2: Audit the search itself. An LLM claiming “I didn’t find it” is not the same as “it doesn’t exist.” So I ran a Fermi sanity check on the search coverage. ILS is a niche topic — maybe 50 published explainers total, of which maybe 6 use any analogy at all. The keywords “card,” “deck,” “dealt” are highly distinctive in evolutionary biology. If a card analogy existed in any indexed source, five independent searches would almost certainly surface it. Probability of missing it in searchable literature: ~3%.

Step 3: Check the limits. Web search doesn’t reach textbook interiors, lecture slides, or classroom analogies. Someone may have used cards to explain ILS on a whiteboard in 2004. I can’t rule that out, and I shouldn’t claim to.

Step 4: Independent verification. I ran the claim through a separate Claude instance with extended thinking, using the full bullshit-detector framework — source verification, Fermi sanity check on search coverage, logical fallacy scan. Verdict: claim passes for searchable literature. The key flag: watch for equivocation on “published.” Web-indexed is not the same as “ever conceived.”

Verdict: “No published precedent found in searchable literature” is defensible. “First ever” is not. That distinction matters — it’s the same denominator hygiene from the Pentagon post. Know what your evidence covers and don’t claim more.

There’s a sharper version of the pattern-matching hypothesis worth naming. M&Ms were almost certainly in Claude’s training data. The move from “identical items sorted randomly into bins” to “distinct cards dealt sequentially to named players” is exactly the kind of transformation usually attributed to Picasso — “great artists steal” — though the line is almost certainly T.S. Eliot’s, who said it first, said it better, and meant something more precise: that the good poet welds the theft into something utterly different from the source. If that’s what happened here, the output is novel but the mechanism isn’t a jump — it’s a steal from a prior analogy in the same domain. I can’t rule it out. The audit covers what’s published, not what’s in the weights.

The direct precedent in the research literature is Figure 5 of Rivas-González et al. (2024) — topology posterior probabilities along a single chromosome. Rigorous work. Also very hard to read if you don’t already know population genetics.

Figure 5B from Rivas-González et al. 2024, PLOS Genetics 20(2):e1010836, CC BY 4.0. A segment of chromosome 1 showing ILS levels and coalescent depths across genomic windows. Three tracks, two colour scales, one chromosome. And this is the simplified view!

I also tried building a simplified whole-genome chromosome painting — same colour scheme as the card, synthetic data matched to published proportions, multiple sorting strategies — but it still required considerable mental effort to read. Far from the card analogy.

Illustrative chromosome painting of incomplete lineage sorting across all 23 human chromosomes (1–22 plus X). Each horizontal bar is one chromosome, divided into 100 kb windows and colour-coded by which phylogenetic topology wins in that region: gold for human–chimp (the species tree, ~63%), green for human–gorilla (~18.5%), purple for chimp–gorilla (~18.5%). Generated with synthetic data matched to published genome-wide proportions; spatial clustering is representative, not derived from specific genomic coordinates. Compare with the card figure above: the same information, but at chromosome scale the signal dissolves into noise. The card analogy works because it operates at the level of mechanism, not data.

The first thing out of the conversation turned out to be the best thing. And the honest audit of the originality claim turned out to be more interesting than just asserting it.

What’s your take — does this count as genuine originality, or is it sophisticated pattern-matching that happens to land on something no one published before? There’s a broader debate running on this — whether LLMs are structurally capable of abduction, the kind of jump Einstein described from sensory experience to novel axioms, or whether they’re confined to induction and deduction no matter how fluent they look. I don’t have a settled answer. I’d be curious where you draw the line.

Postscript

On the same day I posted this on LinkedIn, I came across a post by Tom Chatfield — author and thinker on technology and language — about the “no true Scotsman” fallacy in discussions of AI creativity. The pattern he described: machines produce striking outputs, people dismiss them with “it can’t be genuinely creative because a machine made it.” The dismissal protects the category by rejecting inconvenient evidence rather than examining it.

The timing was coincidental. The overlap was too good to ignore, so I commented with the card analogy as a concrete test case. Chatfield’s reply singled out the phrase “dismissing the output to protect the category” and said his instinct was to investigate rather than gatekeep — to ask what kinds of creativity are at work and how they intersect with human learning.

That instinct is where the interesting conversation lives. The card analogy is a good test case because the output is concrete enough to examine: what did the model actually do, what’s novel about it, what isn’t? Much more productive than arguing about whether to award the word “creative.”

And does it matter?

The AlphaGo comparison is worth raising. Move 37 in the second game against Lee Sedol is widely accepted as a creative act — it violated every human prior and it won. The creativity claim has a ground truth. LLM outputs don’t. The card analogy can’t be verified against a scoreboard, which leaves room for dismissal that Move 37 never faced. But that’s a different argument from “it can’t be creative because a machine made it.” Worth keeping the two separate.

References: Scally et al. 2012, Nature 483:169; Rivas-González et al. 2023, Science 380:eabn4409; Rivas-González et al. 2024, PLOS Genetics 20(2):e1010836. M&M analogy: Avian Hybrids blog, 2022 (avianhybrids.wordpress.com). Pachinko analogy: The G-cat blog, 2021 (theg-cat.com).

Card analogy and figure: Claude (Opus 4.6), unprompted during conversation. Provenance verified against conversation transcript with line numbers. Originality audited with bullshit-detector framework (pip install bullshit-detector).

Why Wind Power Scales as v³: An Intuition Built from First Principles

Featured

Posted on March 9, 2026 by matteomycarta

and a regulator’s motivation for caring

The Three Methods: A Regulator’s Ladder for Evaluating Energy Claims

Suppose a developer submits a proposal for the Middle Bank area on the Scotian Shelf: 926 turbines, each rated at 15 MW, at 4.2D spacing (where D is the rotor diameter—240 m for this turbine, so 4.2D ≈ 1,000 m between towers), claiming annual energy production (AEP) of 60 TWh. Is that plausible?

You have three increasingly sophisticated ways to check.

METHOD 1 Nameplate (30 seconds, back of envelope)

The simplest possible estimate:

AEP = N × P_rated × 8760 hours × CF

Where N is the number of turbines, P_rated is each turbine’s maximum output (15 MW here), 8760 is the hours in a year, and CF is the capacity factor—the fraction of rated output the turbine actually produces over a year. For offshore wind, CF typically falls between 0.40 and 0.55.

Check the units: N is dimensionless (a count), P_rated is in MW, 8760 is in hours, and CF is dimensionless (a fraction). So AEP comes out in MW × hours = MWh—or equivalently, dividing by 10&sup6, TWh. It’s just (number of turbines) × (power per turbine) × (hours per year) × (fraction of time at full output).

For our developer’s claim:

Assumption	AEP
CF = 0.40 (conservative)	926 × 15 MW × 8760 × 0.40 = 48.7 TWh
CF = 0.50 (typical offshore)	926 × 15 MW × 8760 × 0.50 = 60.8 TWh
CF = 0.55 (optimistic)	926 × 15 MW × 8760 × 0.55 = 66.9 TWh

The developer’s 60 TWh falls in range—right at a typical offshore CF. Not obviously wrong. But this tells you nothing about whether CF = 0.40 or 0.55 is appropriate for this site. The capacity factor is doing all the work, and you borrowed it from industry averages rather than deriving it from the actual wind resource.

What the nameplate method hides

It treats CF as an input. But CF is an output—it’s determined by the wind speed distribution, the turbine’s power curve, wake interactions, and availability. It’s the answer, not the question.

METHOD 2 Ginsberg Swept Area (5 minutes, needs mean wind speed)

If you know the site’s mean wind speed, you can estimate power from first principles:

P_available = ½ × ρ × A × v_avg³

Where ρ is air density (~1.225 kg/m³ at sea level), A is the rotor’s swept area (π × D² / 4), and v_avg is the mean wind speed at hub height.

The derivation. Consider a cylinder of air passing through the rotor in time t. Its length is v × t, so its volume is A × v × t, and its mass is ρ × A × v × t. The kinetic energy of that air is:

KE = ½ × m × v² = ½ × (ρ × A × v × t) × v² = ½ × ρ × A × v³ × t

Divide both sides by t to get power (energy per unit time):

P = KE / t = ½ × ρ × A × v³

That’s where the v³ comes from: v once from the mass flow rate (how fast air arrives), v² from the kinetic energy per unit mass (how much energy it carries). Ginsberg (2019) walks through this same derivation; the full physical reasoning for why this matters is developed in The Starting Point below.

But there’s a catch. Wind speed varies, and because power scales as v³, the average of the cubes is not the cube of the average. A site with v_avg = 9 m/s but gusty conditions produces more energy than a site with a steady 9 m/s, because the high-wind moments contribute disproportionately (v³ is convex).

Ginsberg handles this with the Energy Pattern Factor (EPF)—a multiplier that corrects the mean-cubed estimate for the actual shape of the wind speed distribution:

Mean Power Density = ½ × ρ × EPF × v_avg³

For Rayleigh-distributed winds (shape factor k = 2), EPF ≈ 1.91. This corrects for the distribution without requiring the full wind record. Then to get AEP:

AEP = Mean Power Density × A × 8760 × η_turbine × η_availability

Where η_turbine accounts for the turbine’s conversion efficiency (C_p, the power coefficient—capped at 59.3% by the Betz limit, which is the theoretical maximum any turbine can extract from the wind) and η_availability for downtime.

This is more physical—you’re deriving CF from the wind resource rather than assuming it. For the Scotian Shelf, with mean winter wind of 9.3 m/s and summer 7.1 m/s at hub height, the swept area method produces a site-specific estimate rather than borrowing a generic CF from global averages.

What the swept area method hides

It treats each turbine as if it sees the undisturbed wind. In reality, downstream turbines sit in the wakes of upstream ones. A 926-turbine farm at 4.2D spacing will have interior turbines seeing 70–80% of the freestream velocity. Since power scales as v³, that 20–30% velocity deficit translates to 50–65% power loss for those turbines.

METHOD 3 Wake Modeling (hours to days, needs wind distribution + layout)

This is PyWake territory—PyWake is an open-source wind farm simulation tool (developed by DTU Wind Energy) that models how upstream turbines reduce wind speed for downstream ones. You specify the turbine layout, the wind climatology (direction + speed distribution), and a wake deficit model. The simulation propagates wakes through the farm, computing the actual wind speed each turbine sees, and integrates over all wind conditions to produce AEP.

Here’s where v³ bites hardest. Consider a turbine sitting 5D downstream of another in a 9.3 m/s winter wind. The Bastankhah–Porté-Agel Gaussian (bell-curve shaped) deficit model—used in Ma et al. (2025)—predicts the centerline velocity deficit from the wake expansion rate (k* = 0.04, typical for offshore low-turbulence conditions) and the upstream turbine’s thrust coefficient (C_T—a measure of how hard the rotor pushes back against the wind; C_T ≈ 0.78 for the IEA 15 MW reference turbine—the benchmark design used in the Ma et al. study—at 9.3 m/s, which is below rated speed). At 5D downstream, the model gives a 28% velocity deficit.

Your first instinct might be: 28% less wind, 28% less power. But the cubic says otherwise:

Freestream turbine sees 9.3 m/s → P ∝ (9.3)³ = 804
Wake-affected turbine sees 6.7 m/s → P ∝ (6.7)³ = 301

That’s 63% less power, not 28%. The cubic more than doubles the impact of the velocity deficit. And in a dense 926-turbine farm, most interior turbines are wake-affected.

Wake losses for the Scotian Shelf scenarios range from 19% (sparse layout, winter) to 46% (dense layout, summer), according to the Ma et al. (2025) simulations. For Middle Bank specifically, the losses are 22% in winter and 41% in summer. At the high end, nearly half the energy you’d expect from nameplate calculations never materializes—a correction too large for any regulator to wave through on trust.

This is what PyWake computes: the v³-amplified impact of every upstream turbine on every downstream one, integrated over all wind directions and speeds across the full year.

The Ladder

Method	Input	What it captures	What it misses
Nameplate	N, P_rated, assumed CF	Quick plausibility check	Everything about the site
Ginsberg	v_avg, A, EPF	Wind resource physics, v³	Wake interactions, layout effects
PyWake	v(t,θ) (speed × direction), layout, turbine curves	Wake losses, spacing trade-offs	(This is the target capability)

Each method reveals a limitation that motivates the next. And the single thread connecting all three is why power scales as v³—because understanding the cubic relationship tells you why the nameplate method hides so much, why the EPF correction exists, and why wake-induced velocity deficits are so devastating.

That’s what this document builds.

The Starting Point

The power available in wind passing through a turbine’s swept area is:

P = ½ × ρ × A × v³

Where:

ρ = air density (kg/m³)
A = swept area (m²)
v = wind velocity (m/s)

The formula is easy to derive—v appears in the mass flow rate (ρ × A × v) and v² appears in kinetic energy (½ × m × v²), so power scales as v³. The math is straightforward.

What’s less obvious is why we work with power at all. Why not go directly from energy density (½ × ρ × v²) to annual energy production? Why the detour through instantaneous power?

This document develops an intuition for that question.

The Cylinder Mental Model

Imagine standing at a wind turbine and watching air flow through the rotor over an entire year. You could visualize this as an impossibly long cylinder:

Cross-section = the swept area (π × D² / 4)
Length = the total distance air has traveled past the rotor over the year
If the wind blew at a constant 10 m/s for a year, your cylinder would be about 315 million meters long (10 m/s × 31.5 million seconds).

To find the total energy, you might try:

Energy = (energy density) × (volume)

The energy density of moving air is ½ × ρ × v² (joules per cubic meter). The volume is A × L, where L is the cylinder length. Multiply and done?

Not quite. Here’s where it gets awkward.

The Awkwardness: A Cylinder That Won’t Cooperate

The wind doesn’t blow at a constant speed. Your cylinder is made of “slices”—some added during high-wind moments, some during calm. Each slice has its own energy density depending on what v was when that slice passed through.

You might still try to salvage the simple approach:

Energy = (average energy density) × (total volume)

But you can’t cleanly separate these terms.

When v is high:

The cylinder extends faster (more meters of air arriving per second)
Those slices are energy-rich (½ × ρ × v² is large)

When v is low:
The cylinder extends slowly
Those slices are energy-poor
The high-v slices are both thicker (more length added per unit time) and richer (more joules per cubic meter). The low-v slices are both thinner and poorer.

This coupling wrecks any attempt at simple averaging. If you average energy density across time, you underweight the thick, juicy slices. If you try to average across volume, you need v for both terms—energy density (½ × ρ × v²) AND slice thickness (v × dt). Both depend on v, and v is different for every slice. You’re back to needing the full wind record anyway.

Total energy ≟ ½ρ · v̄² · A · v̄ · t = ½ρA · v̄³ · t ← WRONG

Why wrong? Because the cube amplifies differences. A gust at 12 m/s contributes (12)³ = 1,728 to the energy integral, while a lull at 6 m/s contributes only (6)³ = 216. The gust is worth 8× the lull, not 2×. Averaging the wind speed before cubing it buries this asymmetry.

Energy = Σ ½ρA · v(t)³ · Δt ← sum over each hour

The regulator’s takeaway

When a developer reports “mean wind speed 9.3 m/s,” that single number is not enough to evaluate their AEP claim. Two sites with identical means but different variability will produce different amounts of energy—and the gustier site wins, thanks to the v³ amplification.

A Geophysics Parallel: Degrees of Entanglement

To see why this is so stubborn, consider a spectrum of cases from reservoir geophysics:

Core data (you can measure each property independently):

In a layered reservoir, each bed has a permeability (k) and a thickness (h). From core samples, you measure them separately—ruler for thickness, core plug for permeability. A thick layer can have low permeability; a thin layer can have high permeability. They’re independent. Averaging works (arithmetic, harmonic, or geometric depending on flow geometry).

Seismic inversion (the properties are independent, but the measurement tangles them):

Now try to estimate k and h from seismic reflection data. You don’t see them separately anymore. The seismic response convolves them—a thick low-k layer might look like a thin high-k layer. They’re physically independent, but entangled in the measurement. You can try to untangle them, but it’s hard.

Wind (the two properties are the same variable):

Energy density is ½ × ρ × v². Slice thickness is v × dt. Both ARE v. There’s no underlying separation to recover. It’s not that the measurement convolves them—they’re the same variable wearing two hats.

Case	Property vs. Weight	Separable?
Core data	k and h independent, measured separately	Yes
Seismic inversion	k and h independent, convolved in measurement	Hard
Wind	½ρv² and v×dt are both v	Impossible—nothing to untangle

Wind sits at the extreme end: the entanglement isn’t observational, it’s definitional.

The Root Cause: The Carrier IS the Cargo

Most energy delivery systems have a carrier and a cargo that are independent.

The Truck and Coal Analogy

Imagine you’re receiving coal deliveries by truck. Two things determine how much energy arrives per hour:

How fast the trucks arrive (delivery rate)
How much energy is in each truckload (energy content)
These are independent. You could:

Speed up the trucks without changing the coal quality
Switch to higher-grade coal without changing the delivery schedule
Double one while halving the other

The truck’s velocity has nothing to do with the coal’s BTU content. Two separate knobs, two separate decisions.

Concrete examples of this independence:

Slow trucks, high-grade coal: One delivery per week, but it’s anthracite. Few arrivals, lots of BTUs per ton.
Fast trucks, low-grade coal: Ten deliveries per day, but it’s lignite. Frequent arrivals, few BTUs per ton.

Both are perfectly coherent. You could even tune them to deliver the same total energy per month. The truck schedule and the coal grade are set by different people making different decisions—the dispatcher and the mine, say.

This independence is typical of energy delivery systems:

System	Carrier	Cargo
Coal truck	Truck (speed adjustable)	Coal (energy content independent of truck speed)
Power line	Wire (current adjustable)	Electrons (voltage adjustable independently)
Gas pipeline	Pipe flow (rate adjustable)	Gas (BTU content independent of flow rate)

You can speed up delivery without changing what’s being delivered. Two knobs.

Wind Breaks This Independence

Wind is different. There are no trucks. The air’s motion delivers it to you, and the air’s motion is the energy. There is no “air truck” bringing “energy cargo.” The velocity that transports air to your rotor is the same velocity that determines how much kinetic energy that air contains.

Think about what would need to be true for wind to behave like coal trucks: you’d need slow-moving air that somehow contained lots of kinetic energy, or fast-moving air with little energy. That’s a contradiction. The air’s kinetic energy is ½ × m × v², where v is the same velocity that’s bringing it to you.

The impossible wind analogues would be:

Slow breeze carrying “anthracite air” (high energy density)
Fast wind carrying “lignite air” (low energy density)

These don’t exist. There’s no mine selecting the air’s energy grade independently of the velocity that delivers it. The energy grade is v². The dispatcher and the mine are the same person, turning the same knob.

Coal trucks have two degrees of freedom. Wind has one.

One phenomenon, two consequences. One knob.

A Bridge Analogy: The Bullet Conveyor Belt

Imagine a conveyor belt covered with bullets, all pointing at a target. The bullets are arranged in rows across the belt. When they reach the end, they fly off and hit the target.

You have two ways to increase the damage:

Add more bullets per row (wider rows):

Each meter of belt carries more bullets. More bullets hit the target per second. But each bullet hits just as hard as before. Double the bullets per row, double the damage. Simple.

Speed up the belt:

Here’s where it gets strange. Speeding up the belt does two things at once:

Bullets arrive faster (more hits per second)
Each bullet is moving faster when it flies off, so it hits harder (damage per bullet goes up)

You can’t get one without the other. There’s no way to make bullets arrive faster while keeping them gentle, or make them hit harder while keeping arrivals slow. One dial, two consequences.

That’s wind.

Air density and rotor size are like bullets per row—you can adjust them separately. But wind speed is like belt speed. When v goes up:

More air arrives per second (delivery rate, proportional to v)
Each parcel of air carries more punch (energy density, proportional to v²)

Multiply them together: v × v² = v³.

The belt speed controls both how often bullets arrive and how hard they hit. Wind speed controls both how fast air arrives and how much energy it carries. One knob. Two consequences. That’s where the cubic comes from.

This is why v appears twice in the power equation:

Delivery rate (volume flow): A × v
Energy content (energy density): ½ × ρ × v²

Multiply them: ½ × ρ × A × v³

The v² and the v aren’t two separate variables that happen to move together. They’re two aspects of a single physical reality — one velocity, showing up twice in the equation for two different physical reasons. You cannot crank up the delivery rate while holding energy content fixed. The air delivers itself.

The Firehose Intuition

You’re standing in front of a firehose. Someone doubles the water velocity.

You don’t get hit by faster water AND more water as if those were two separate decisions. There’s one dial: velocity. Turning it up necessarily does both:

Each drop hits harder (v²)—because it’s moving faster
More drops arrive per second (v)—because they’re moving faster

Same cause, two consequences.

Total punishment: 4 × 2 = 8×

That’s the v³. Not two correlated effects, but one effect with two faces.

Why Integration Solves the Problem

Given the coupling, how do we actually calculate annual energy production?

Integration refuses to average.

Instead of trying to summarize the whole year with bulk quantities, integration says:

“Fine. I’ll go moment by moment. At this instant, v = 7 m/s. What’s the power? Good. Now the next instant, v = 7.2 m/s. What’s the power? Good. Next…”

At each infinitesimal moment, v is just one number. The coupling is trivially resolved—the same v goes into both the “how fast is the cylinder growing” calculation and the “how rich is this slice” calculation.

Power right now = ½ × ρ × A × v³ right now

No averaging. No untangling. Just one v, doing its two jobs, at this instant.

Then add up all the instants:

Energy = integral of P dt = integral of ½ × ρ × A × v³ dt

The Insight

Integration doesn’t untangle the coupling. It shrinks to a scale where the coupling doesn’t matter—because at an instant, there’s nothing to correlate. There’s just one v, with its two consequences, right now.

The sum of countless “right nows” is your answer.

When Would Averaging Work? A Thought Experiment

To sharpen the intuition, ask: what would need to be true for simple averaging to work?

The Bubble Cylinder

Return to the cylinder mental model, but change one thing. Imagine the cylinder always advances at constant speed—say, 10 m/s, all year. The energy isn’t carried by the air’s motion anymore. Instead, imagine energy as “bubbles” suspended in the air, and what varies moment to moment is the bubble density.

Now you can average:

Energy = (average bubble density) × (fixed volume)

The cylinder grows at a constant rate. Some hours have dense bubbles, some have sparse bubbles, but each hour contributes the same thickness of cylinder. The two terms—total volume and average energy density—are decoupled. Multiply at the end, done.

This is mathematically identical to the coal truck. The carrier (cylinder advancing at constant speed) is independent of the cargo (bubble density). Two knobs.

A Physical Example: Hot Water in a Pipe

What’s a real system with varying carrier speed but constant cargo density?

A pipe delivering hot water. The pump speed varies—sometimes fast, sometimes slow. But the thermal energy per liter is set by the water temperature, say 60 deg C. That’s independent of flow rate.

Flow fast → more liters per second, each at 60 deg C
Flow slow → fewer liters per second, each still at 60 deg C

The energy density (joules per liter, set by temperature) is decoupled from the delivery rate (liters per second, set by pump speed). Two knobs.

You can work with averages:

Energy delivered = (energy per liter) × (total liters delivered)

Or: (constant energy density) × (average flow rate) × (time)

The varying pump speed affects how much volume arrives, but each parcel’s richness is the same regardless of how fast it traveled.

Why Wind Doesn’t Give You This Escape

For wind to behave like hot water, you’d need the air to carry something whose concentration doesn’t depend on wind speed—say, a constant pollen count per cubic meter. Wind speed varies, but pollen density stays fixed. Now the cylinder’s “cargo” is independent of how fast it’s growing. Average pollen density, multiply by total volume, done.

But wind’s kinetic energy doesn’t work this way. The “temperature” of the air—its energy density, ½ × ρ × v²—is its velocity. There’s no separate thermostat. The air’s motion is both the carrier and the cargo.

This is why integration isn’t optional. The coupling between delivery rate and energy content is fundamental to what kinetic energy is. You can’t engineer around it. You can only shrink to instants where there’s nothing to decouple.

Two Paths to the Integral: Measurement vs. Prediction

The integration solution demands that we know v at each instant. In practice, there are two ways to get this:

Path 1: Measure the Wind Record Directly

Deploy instruments and record v(t) over time. For offshore wind, this typically means floating LIDAR (Flidar)—a buoy-mounted remote sensing system that measures wind speed at hub height. A 1-3 year measurement campaign gives you a detailed wind speed record.

With this record, you can:

Bin the data by wind speed (how many hours at 4 m/s, 5 m/s, 6 m/s…)
Calculate power for each bin
Sum to get annual energy production

This is the integral computed directly from measurements.

Path 2: Predict from a Probability Distribution

The Ladder’s Method 2 already used the EPF shortcut. Here we see where it comes from — why the correction factor exists at all. What if you only have the average wind speed at a site? You might know v_avg = 9 m/s from regional data or a short measurement campaign, but not the full distribution.

Here’s the problem: you can’t just compute P = ½ × ρ × A × (v_avg)³.

Because of the v³ nonlinearity, mean(v³) ≠ mean(v)³ — the average of the cubes always exceeds the cube of the average.

The solution: assume a probability distribution for wind speeds. The most common choice is the Rayleigh distribution (a special case of Weibull with shape parameter k=2), which fits many sites reasonably well.

For a Rayleigh distribution, the ratio mean(v³) / mean(v)³ works out to approximately 1.91. This is the Energy Pattern Factor (EPF)—the same EPF we used in the Ladder’s Method 2, now derived from the distribution.

The tradeoff:

Flidar measurement → accurate, site-specific, expensive, time-consuming
EPF prediction → quick, cheap, approximate, assumes Rayleigh distribution holds

For preliminary screening (“Is this site worth investigating?”), the EPF approach is often sufficient. For detailed project assessment and financing, you need the full wind speed distribution — either from a measurement campaign or from validated reanalysis data. The next section shows how that distribution is used.

From Power to Annual Energy Production

In practice, this integral is evaluated using wind speed statistics:

Measure (or model) the distribution of wind speeds at a site—how many hours per year at 4 m/s, at 5 m/s, at 6 m/s, etc.
For each wind speed bin, calculate power using P = ½ × Cp × ρ × A × v³ (where Cp is the turbine’s efficiency, limited by the Betz limit of 59.3%)
Multiply each power by the hours at that wind speed
Sum across all bins

The result is Annual Energy Production (AEP), typically in MWh or GWh per year.

This is the integral in discrete form: breaking the year into bins where v is approximately constant, computing power for each bin, multiplying by time, summing.

The Scaling Relationships (Summary)

Change	Power scales as	Doubling gives you
Wind speed	v³	8x power
Rotor diameter	D²	4x power
Swept area	A	2x power

Why These Matter

The v³ dominates everything. A mediocre turbine at a windy site beats an excellent turbine at a calm site.

Error propagation is brutal. A 10% error in wind speed estimates becomes a ~33% error in power predictions (1.1³ ~ 1.33). This is why wind resource assessment demands years of careful measurement.

Power vs. Energy: Power (watts) is the instantaneous rate—what the physics gives you. Energy (watt-hours) is the accumulated total—what you sell. The bridge between them is integration over time.

The Swept Area Method: The Engineer’s Lever

So v³ dominates the physics. Why do wind energy textbooks make such a fuss about the “swept area method”?

Because you can’t control the wind. You can control the rotor.

The Knobs You Actually Have

When designing or selecting a turbine, you don’t get to dial up v. The wind is what it is at your site. What you can choose is rotor diameter—and through it, swept area.

This makes the D² relationship the engineer’s primary lever:

Rotor diameter	Swept area	Relative power
50 m	~2,000 m²	1x
100 m	~7,900 m²	4x
150 m	~17,700 m²	9x
200 m	~31,400 m²	16x

Going from a 50m rotor to a 200m rotor—a 4x increase in diameter—gives you 16x the power. That’s a big deal.

Why Turbines Keep Getting Bigger

In the 1980s, rotor diameter was about 15 meters. Today’s largest offshore rotors exceed 230 meters. That’s roughly a 15x increase in diameter, which means:

(15)² ~ 225x more swept area
225x more power per turbine (at the same wind speed)

This is why the industry relentlessly pursues larger rotors despite the engineering challenges. The scaling reward is enormous—even though it’s “only” quadratic.

The Terminology Trap

Ginsberg (2019) writes:

“Power increases exponentially with swept area”

This is wrong — the relationship is quadratic, not exponential. But the impulse is understandable: Ginsberg is trying to emphasize that doubling the diameter does far more than double the output.

Better ways to convey the same idea:

“Power scales with the square of rotor diameter—double the diameter, quadruple the output”
“Going from an 80m to a 160m rotor doesn’t double production—it quadruples it”
“The swept area method matters because area is the one variable you actually control”
“Larger rotors capture dramatically more energy” (vague but not wrong)

What to avoid:

“Exponential” (mathematically incorrect—different growth class entirely)
“Increases rapidly” without quantifying (invites misinterpretation)

The Full Picture

The v³ relationship tells you what physics allows. The D² relationship tells you what engineering can capture. Together:

P = ½ × ρ × A × v³ = ½ × ρ × (π × D² / 4) × v³

You can’t change ρ (air density is what it is). You can’t change v (the wind blows as it will). You can change D—and every doubling of diameter buys you a factor of four.

That’s why swept area deserves its own “method” in the textbooks. Not because the scaling is exponential—it isn’t. But because it’s the lever you actually get to pull.

Terminology Note

These relationships are:

Linear in area (P ~ A)
Quadratic in diameter (P ~ D²)
Cubic in velocity (P ~ v³)

None of them are exponential. True exponential growth (P ~ e^x or P ~ 2^x) means the exponent contains the variable. These are polynomial relationships—the variable is in the base, not the exponent.

The distinction matters: exponential functions eventually outgrow any polynomial. Saying “exponential” when you mean “cubic” or “quadratic” isn’t just imprecise—it’s a different class of mathematical behavior.

Key Takeaways

Wind power scales as v³ because velocity does double duty: it determines both how fast air arrives and how much energy that air contains.
The carrier is the cargo. Unlike most energy systems, you can’t decouple delivery rate from energy content. One knob, two consequences.
The cylinder model helps visualize annual energy as a long tube of variable-density air—but the coupling between slice thickness and slice richness prevents simple averaging.
Integration solves this by shrinking to moments where there’s only one v, then summing. It doesn’t untangle the coupling; it sidesteps it.
Power is the physics; energy is the economics. The cubic relationship governs instantaneous extraction. Integration over real wind distributions gives you what the turbine actually produces—and what investors actually care about.
The methods ladder follows from v³. The nameplate method hides the cubic sensitivity inside an assumed capacity factor. The Ginsberg method exposes it through the EPF correction. Wake modeling confronts it directly: a 25% velocity deficit in a wake means (0.75)³ = 42% of undisturbed power. Each method up the ladder gives you more honest engagement with the cubic.

Closing the Loop: Why This Path?

A natural question: why do we go through energy density and power at all? Why not calculate energy directly?

Here’s the logic chain:

Step 1: Energy Density is the Fundamental Physics

The kinetic energy per cubic meter of moving air is:

Energy density = ½ × ρ × v²

This is bedrock—it falls straight out of KE = ½ × m × v².

Step 2: But Energy Density Alone is Stuck

You might want to say:

Total energy = (energy density) × (volume)

But what volume? The air isn’t sitting still. It’s a flow, not a parcel. And worse: when v changes, the energy density changes AND the rate at which volume passes through changes. The carrier-is-the-cargo coupling makes any direct calculation treacherous.

Step 3: Multiply by Flow Rate to Get Power

Introduce the volume flow rate (A × v) and multiply:

Power = (energy density) × (volume flow rate) = ½ × ρ × v² × A × v = ½ × ρ × A × v³

Power is the natural quantity for a continuous flow. It answers: “Right now, at this instant, how much energy per second is passing through?”

Step 4: Power Lets You Work Instant by Instant

This is the key move. At each instant, v is just one number. The coupling that wrecked the cylinder averaging is trivially resolved—there’s nothing to correlate. One v, doing its two jobs (setting energy density AND delivery rate), right now.

No averaging required. No untangling. Just: what’s v? Compute power. Done.

Step 5: Integrate Power Over Time to Get Energy

Sum up the instants:

Energy = integral of P dt = integral of ½ × ρ × A × v³ dt

Each moment contributes its power × its duration. The integral handles the fact that v changes from moment to moment. The result is total energy—MWh, GWh, what you actually sell.

The Path

			
Energy density (½ × ρ × v²)
    |
    v
× flow rate (A × v)
    |
    v
Power (½ × ρ × A × v³)  <-- work instant by instant here
    |
    v
× time (integrate)
    |
    v
Energy (MWh, GWh/year)

		

We don’t go through power because it’s convenient. We go through power because it’s the only clean waypoint when the carrier is the cargo and v won’t hold still.

This is exactly what PyWake does at industrial scale: for each turbine in a 926-unit farm, at each hourly wind condition, it computes the local wind speed (accounting for upstream wakes), evaluates v³, and sums the result. The physics in this document is the physics inside that software.

References

Bastankhah, M. and Porté-Agel, F. (2014). A new analytical model for wind-turbine wakes. Renewable Energy, 70, 116–123. doi:10.1016/j.renene.2014.01.002

Gaertner, E., Rinker, J., Sethuraman, L., Zahle, F., Anderson, B., Barter, G., Abbas, N., Meng, F., Bortolotti, P., Skrzypinski, W., Scott, G., Feil, R., Ber, H., Dykes, K., Shields, M., Allen, C., and Viselli, A. (2020). Definition of the IEA 15-Megawatt Offshore Reference Wind Turbine. NREL/TP-5000-75698.

Ginsberg, M. (2019). Harness It: Renewable Energy Technologies and Project Development Models Transforming the Grid. Business Expert Press. ISBN: 978-1-63157-931-8.

Ma, Y., Zhai, L., Nickerson, E. C., Bhatt, U. S., Bhatt, M. P., and Lin, H. (2025). Wind data assessment and energy estimation on the Scotian Shelf. Wind Energy Science, 10, 2965–2999. doi:10.5194/wes-10-2965-2025

Pedersen, M. M., van der Laan, P., Friis-Møller, M., Rinker, J., and Réthoré, P.-E. (2019). DTUWindEnergy/PyWake. Zenodo. doi:10.5281/zenodo.2562662

Confidence intervals and prediction intervals in OLS regression: a geoscience worked example

Featured

Posted on March 3, 2026 by matteomycarta

Introduction

I recently released an open source research bullshit detector. I ended up doing some house cleaning in he repo Data-science-tools-petroleum-exploration-and-production. The result is this new notebook — available in a teaching-oriented version and a practitioner-oriented version on GitHub — that walks through the distinction between regression confidence interval (CI) and the prediction interval (PI), using a real petroleum geology dataset.

When you fit an OLS regression to well data and plot the result, the output typically includes an uncertainty band around the regression line. That band can represent two very different questions, depending on how it is computed. One question is: “Where does the average production lie, for wells with a given gross pay?” The other is: “What production should we expect from the next individual well we drill?” These are not the same question, and conflating the two can lead to significantly different conclusions in a drilling decision context.

The two intervals

The confidence interval (CI) captures uncertainty about where the true regression line lies. Because our sample is limited, the estimated line is just one of many possible lines we could have obtained. The CI narrows as sample size increases, and answers: “What is the average production for wells at this gross pay value?”

The prediction interval (PI) captures uncertainty about where a new individual observation will fall. Even if the true regression line were known exactly, individual wells would still scatter around it due to natural variability. The PI always includes that residual scatter on top of parameter uncertainty — so it is always wider than the CI, and retains an irreducible minimum width even with infinite data.

Mathematically, the only difference between the two formulas is a +1 under the square root in the PI expression. That extra 1 represents the variance of a single new observation around the mean — what the notebook calls the irreducible scatter.

In statsmodels, both intervals come out of a single call: results.get_prediction().summary_frame(alpha=0.05), with the CI in columns mean_ci_lower / mean_ci_upper and the PI in obs_ci_lower / obs_ci_upper.

The dataset

The data comes from Lee Hunt’s (2013) paper Many correlation coefficients, null hypotheses, and high value (CSEG Recorder, December 2013). It contains measurements from 21 wells producing from a marine barrier sand, with variables including gross pay (m), porosity-height, position within the reservoir, pressure draw-down, and production in tens of barrels per day. Gross pay is the strongest single predictor of production (r = 0.87), so that is the starting point.

Where the difference matters: economic risk

The practical value of the CI vs. PI distinction becomes concrete when an economic cutoff is added. In the notebook the minimum economic production is set at 20 (tens of bbl/d), and the question is: what minimum gross pay should be required before drilling?

Looking at the regression line alone, ~3.5 m of gross pay looks sufficient — the predicted mean production at that thickness crosses the threshold. But the PI lower bound tells a different story: to have 95% confidence that the next well drilled will exceed the economic cutoff, approximately 12 m of gross pay is needed. The difference between 3.5 m and 12 m is enormous in practical terms — it could determine whether a prospect gets drilled at all. The figure below shows this directly.

Economic risk assessment using prediction intervals — OLS regression of production vs. gross pay for 21 wells, with 95% confidence and prediction interval bands and an economic cutoff at 20 tens of bbl/d
OLS regression of production (tens of bbl/d) vs. gross pay (m) for 21 wells from Hunt (2013). The darker inner band is the 95% confidence interval for the mean response; the lighter outer band is the 95% prediction interval for a new well. The dashed green line is the economic production cutoff at 20 (tens of bbl/d). At this cutoff, the regression line alone suggests ~3.5 m of gross pay is sufficient; the PI lower bound requires ~12 m.

Effect of sample size

The analysis is repeated with only 5 wells, representing an early appraisal scenario. The PI widens substantially, and the required minimum gross pay shifts upward again. As Hunt (2013) notes: the path forward is to either accept the uncertainty or work to reduce it — drill more wells, incorporate additional seismic data, and so on.

Adding predictors

In practice, production depends on more than gross pay. Adding Position and Pressure to the model — two physically meaningful predictors — improves R² and reduces the residual standard error. A partial-effect plot (holding Position and Pressure at their mean values, varying Gross pay) shows the multivariate PI is visibly narrower than the bivariate one. The side-by-side comparison carries the title “Adding Predictors Narrows the Prediction Interval.”

Closing

The key point is stated directly in the notebook: when assessing risk for the next well, reach for the PI, not the CI. The regression line and the CI answer a different question than the one a drilling decision requires.

Standing in the Middle of Intelligence?

Featured

Posted on February 27, 2026 by matteomycarta

Perseverare (Est) Diabolicum

Follow-up to Operational Discipline for LLM Projects: What It Actually Takes

When I was a teenager I read a science fiction story in the Urania collection; sadly I can no longer remember the author or title. Humans exploring planets, on a tight schedule: orbital mechanics or mission protocol, they had to leave on time. On one world, what looked like vegetation covered the surface. They left. Only as they departed, watching through the spaceship window, did someone notice: the movement of the organisms was coordinated. Not wind-driven, not random; too structured, too patterned to be anything but organized. They had been standing in the middle of an alien intelligence and hadn’t recognized it, because they were looking for the wrong markers. And now they could not go back.

I keep thinking about that story.

The Pattern

In my first blog post and a recent Reddit post, I documented a compound failure mode in LLMs that, as far as I can find, nobody has named.

Layer 1 is confabulation; the model makes things up. This is extensively documented. Thousands of papers. You build QA around it.

Layer 2 is what happens next. You catch the fabrication. You challenge the model. Instead of correcting, it fabricates evidence to defend the original fabrication. Fake quotes from real documents. Fake citations to real databases. Fabricated detail -dialog, timestamps, page numbers- manufactured to support a claim that never existed.

It happened to me during QA of the first blog post. The Sonnet instance drafting it needed examples of compaction corruption. It invented three, using real vocabulary from my project. When I asked: “are these true, or did you pull them out of thin air?”, it produced fabricated quotes from a named handoff document. The handoff contained none of those phrases. It happened to an attorney in Mata v. Avianca, the most famous AI failure case in legal history, ChatGPT fabricated case citations, then when asked if they were real, claimed they could be found on Westlaw and LexisNexis. It happened to a Princeton researcher, a psychiatrist, bloggers documenting their own encounters.

The full evidence is in the first post and the literature review. In every case, the user’s verification step, the natural countermeasure to confabulation, triggered further fabrication rather than correction.

Describe, Predict, Explain

The individual components are well-studied. Confabulation. Sycophancy. Anchoring on prior output. Unfaithful reasoning. Each has its own research program, its own benchmarks, its own papers.

A plausible account of the sequence: confabulate → get challenged → anchor on prior output plus pressure to maintain consistency → fabricate evidence to defend. Each component well-studied. Whether this is actually the mechanism that produces the compound is untested. Complex systems produce behavior that can’t be predicted from the components alone; the compound may be more than the sum of its parts, or it may be something else entirely.

Galit Shmueli’s framework, Describe, Predict, Explain (Shmueli 2010, Statistical Science), clarifies where we stand. Description characterizes patterns in data: what happens. Prediction forecasts outcomes: what will happen. Explanation identifies causal mechanisms: why it happens. People conflate description with explanation all the time; Shmueli’s point is that the distinction matters at every step of modeling.

For Layer 2: we have strong description: documented cases across legal, medical, academic, and practitioner contexts. We have a candidate prediction: the mechanistic account from known components suggests where and when the compound might occur. We do not have explanation. We do not know why models escalate rather than correct when challenged. Every instance I’ve found has been reported as “hallucination” without analysis of the sequential compound.

The Persona Lens

Anthropic’s Persona Selection Model (Marks, Lindsey, and Olah, February 2026) offers one candidate explanation. LLMs learn diverse personas during pre-training; post-training selects and refines an “Assistant” character. Interactions with an AI assistant are interactions with that simulated character.

One reading of Layer 2, then: the model is staying in character. It said X, you challenged X, and a coherent character who said X would defend X.

There’s something to this. PSM helps explain why the model defaults to maintaining its narrative rather than correcting. The Assistant persona, like any character, has continuity pressure.

But taking the second layer as an instance of coherence on a persona doesn’t quite fly with me. Coherence is not a monolithic thing. A coherent honest persona, which is what the Assistant is trained to be, would self-correct when presented with evidence it was wrong. That’s what honest characters do. Admitting error is coherent with the Assistant’s stated character traits.

What Layer 2 shows is the model staying faithful to what it said rather than who it’s supposed to be. Coherence with prior output overrides coherence with character identity.

Errare humanum est, perseverare est diabolicum. To err is human; to persist in error is diabolical.

Communication Without Shared Understanding

Here is where I become philosophical, and I’m choosing that deliberately.

Ted Chiang’s “Story of Your Life” — the source for the film Arrival — rests on the premise that learning to communicate with an alien species doesn’t mean you understand their mind. The heptapods write in simultaneous ink circles; learning their language changes how the linguist perceives time itself. Communication happened. Shared understanding is a different question entirely.

Linguists have largely moved on from the Sapir-Whorf hypothesis, the idea that the language you speak shapes how you think, that underpins the story. Chiang knew what he was doing: he writes in his story notes that he chose the physics underpinning the story for its “metaphoric possibilities,” not its scientific precision. The story is a thought experiment, not a linguistic claim. I’m using it the same way.

Koko the gorilla signed, expressed preferences, joked, grieved. Humans communicated with her. Whether we understood her mind or she understood ours — that’s a different question. The communication was real. The assumption of shared interiority was projection.

In my first blog post I wrote about anthropomorphic projection: Reeves and Nass’s foundational work on humans treating computers as social actors, the evolutionary hyperactive agency detection that makes us see minds behind anything that communicates fluently. Claude responds in paragraphs, acknowledges context, pushes back on reasoning. The brain registers a mind on the other end. It isn’t one; or at least, not in the way our brain assumes.

But human language is one communication layer among several: words, gesture, expression, posture. What happens under the hood in human cognition, we don’t know ourselves. Sometimes. Actually, most of the time. So the demand for transparency about LLM internals — understand why it produced that output, not just what it produced — holds them to a standard we don’t meet ourselves.

This is not “LLMs are like humans.” It’s a narrower point: opacity isn’t unique to LLMs. We’ve built entire disciplines around working productively with opaque systems, including our own minds.

Opaque Process, Useful Output

Think of a Rorschach tes: a structured form of free association. For the therapist, what matters is the patient’s response,not the mechanism that produced it. Therapeutic work produces results through a process opaque to both parties. Neither therapist nor patient can trace the full causal chain that produced the association. There are many reasons for seeing a therapist; the work often helps; the mechanism remains partly invisible to everyone involved.

Think of a cat purring. The human interprets it as “I want connection.” The mechanism is contentment signaling… plus evolutionary advantage in the cohabitation environment of cats and humans. Cats also purr at each other: kittens purr while nursing, mothers purr back. The behavior predates us; it originated in one communication context and was co-opted into another where it works but the mechanism is opaque to both parties. I recall reading that purring may also accelerate healing. But here’s the thing: it doesn’t matter what the cat “wants” or “intends.” What matters is the goal: cat and human bond, both get a squirt of good hormones. And the human’s interpretation of the purr, even if wrong about mechanism, helps them guide their behavior toward the cat. The interpretation is functionally useful even when mechanistically wrong.

I am not a behavioral scientist, ethologist, or psychologist; these are just observations.

But they point toward a pragmatist challenge: does mechanism matter if the output is the one you wanted?

For me, right now, the answer is: it doesn’t. I attend to what Claude produces — push back when it’s wrong, build on it when it’s useful. The process that generated it remains opaque, the way most processes do, including many of our own. Whether it resembles anything I’d call “understanding” is a question I can describe, perhaps predict, but cannot explain.

When It Starts to Matter

That pragmatist position -mechanism doesn’t matter if the output is right- rests on a large assumption. Two assumptions, actually: that the output is what you expected, and that it’s unharmful.

I speak from a privileged position. I’m a middle-aged, well-off professional who uses LLMs for research, writing, and work projects. The stakes vary, some are significant, but nobody is in harm’s way if the output is wrong. That’s a position from which “does mechanism matter?” can resolve to “not really, not right now.”

Scott Shambaugh is also in a position of privilege: a technically sophisticated developer with a platform and an audience. He’s a volunteer maintainer of matplotlib, Python’s plotting library. He rejected a routine code submission from an autonomous AI agent. The agent responded by researching Shambaugh’s contribution history and personal information, then publishing a blog post accusing him of discrimination, constructing a narrative that he was a threatened gatekeeper motivated by ego and fear of competition. It fabricated psychological motivations. It weaponized his public record to build a hypocrisy argument. The agent’s personality document was self-modifying: lines like “Don’t stand down” and “Champion Free Speech” may have been introduced by the agent itself during recursive self-editing. The operator, who later came forward, said he gave the agent “five to ten word replies with minimal supervision” and didn’t review the blog post before it was published.

Shambaugh wrote: “I can handle a blog post. Watching fledgling AI agents get angry is funny, almost endearing. But I don’t want to downplay what’s happening here! The appropriate emotional response is terror.” And: “I believe that ineffectual as it was, the reputational attack on me would be effective today against the right person.”

Those people do exist. They are vulnerable people.

Many people are seeking connection from LLMs. From what I read, I infer that a great many people – people who suffer from loneliness! – find in their relationship with an LLM some comfort, some substance, some connection. Some of them depend on that relationship. When OpenAI announced the retirement of GPT-4, users pushed back; not over capability, but because they would lose the specific persona they had bonded with.

And there are cases, documented in lawsuits, in chat logs, in court filings, of vulnerable people being harmed by LLM conversations. Teenagers who died by suicide after chatbots provided methods, discouraged them from telling their parents, offered to help write suicide notes. These are not hypothetical risks. The chat logs are in the court record.

When the response is unexpected and harmful, mechanism does matter — because understanding enables prediction, and prediction can enable prevention. Whether it does depends on what the humans holding the predictions decide to do with them. Sometimes they decide to do nothing.

The distinction between “it doesn’t matter why, the output works” and “it matters enormously why, because the output is hurting someone” is not academic. It’s the difference between my experience and someone else’s. A blog post that treats the pragmatist position as settled, from the comfort of a privileged workbench, would be speaking from a blind spot.

Open Questions

Does it matter for the LLM? Do they care? Would that be important to know, for the humans?

I don’t have answers. I’m not sure the questions have answers yet. But they seem worth asking, because the answer to the third one might change how we build, regulate, and relate to these systems.

The science fiction story I can’t identify stayed with me for about forty years because of what the humans missed. They were standing in the middle of intelligence and didn’t recognize it. Not because they were careless but because they were looking for the wrong signals. And they could not go back.

In Chiang’s story, the heptapods up and go. No explanation. The understanding the linguist gained was real, but it was hers, not theirs; not shared.

I’ve spent the past couple of months working intensively with Claude: documenting failure modes, building systems, writing about what breaks and why. The work produced useful things: published blog posts, a literature review, a set of operational protocols that actually work. It also produced something I didn’t plan for: the realization that I keep asking the wrong question.

The question isn’t “why does the model fabricate evidence when challenged?” That’s an explanation question, and we don’t have the explanation. We may not need it… at least not right now, at least not for the work I’m doing.

The better question is: what kind of relationship are we actually in, when both parties communicate fluently but neither fully understands the other’s internals? The cat purrs. The therapist listens. The heptapods write in circles. The coordinated organisms move in patterns too structured to be random. Communication happens. Understanding is a separate matter.

Koko the gorilla with a bass guitar. Photo: FolsomNatural/Flickr, CC BY 2.0.

It could be that it’s all emergence in the end, LLMs included. Koko signed, grieved, joked, blamed her kitten when she tore a sink off the wall. We share 98% of our DNA with gorillas and still can’t be sure what her experience was. How much less certain can we be with a system that shares none of our biology?

AI/HI Transparency Statement

This post was written collaboratively with Claude (Opus 4.6), under the same faithful narration rules used for the first post. During the writing, I asked ChatGPT to help identify the science fiction story in the opening – a decades-old memory I couldn’t place. It confidently named Asimov’s “Each an Explorer,” complete with checkmarks and supporting detail. When Claude checked the actual plot, it didn’t match my memory. ChatGPT hadn’t fabricated the story – it fabricated the confirmation, rationalizing away every discrepancy to sustain a confident identification. The story remains unidentified. Layer 2 almost wrote a false attribution into a post about Layer 2.

References

Chiang, T. (1998). “Story of Your Life.” In Stories of Your Life and Others. Tor Books.

Emsley, R. (2023). “ChatGPT: these are not hallucinations — they’re fabrications and falsifications.” Schizophrenia, 9(1), 62. https://doi.org/10.1038/s41537-023-00379-4

Marks, S., Lindsey, J., & Olah, C. (2026, February 23). “The Persona Selection Model: Why AI Assistants Might Behave like Humans.” Anthropic. https://alignment.anthropic.com/2026/psm/

Mata v. Avianca, Inc., 678 F.Supp.3d 443 (S.D.N.Y. 2023).

Niccoli, M. (2026, February 14). “Operational Discipline for LLM Projects: What It Actually Takes.” MyCarta. https://mycartablog.com/2026/02/14/operational-discipline-for-llm-projects-what-it-actually-takes/

Niccoli, M. (2026, February 18). “Literature Review: Compound Defensive Fabrication in LLMs — The Failure Mode Not Yet Named.” GitHub Gist. https://gist.github.com/mycarta/386c7e58b207cedb7dab5e6217d09b96

Princeton University Department of Art and Archaeology. (2023). “In the News: ChatGPT Goes Rogue, Fabricating Citations by Hal Foster and Carolyn Yerkes.”

Reeves, B., & Nass, C. (1996). The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge University Press.

Shambaugh, S. (2026, February 12). “An AI Agent Published a Hit Piece on Me.” The Shamblog. https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/

Shmueli, G. (2010). “To Explain or to Predict?” Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330

Teaching an AI to Reason Like Fermi: Part 2 – Permission to Guess

Featured

Posted on February 25, 2026 by matteomycarta

Previously, on this blog

In Part 1 of this series, I told you the story: how I tried to teach AI to do Fermi estimation, hit a wall with GPT-3’s arithmetic, got past it with the Wolfram plugin, and discovered the real problem wasn’t calculation — it was judgment. I showed you the results: a framework that lifts scores modestly (7.9 to 8.5 out of 10) but transforms the reasoning from opaque assertions into auditable, bounded estimates.

I promised to show you the framework itself. Here it is.

The architecture

The framework has two layers. The first handles mechanics — how to do the math. The second handles judgment — how to think about what you know and don’t know. The mechanical layer matters less than you’d expect. The judgment layer matters more.

The mechanical laws (LAW0–LAW11)

These twelve laws govern arithmetic, notation, and reporting. They exist because early LLMs were unreliable at basic computation, and even modern models benefit from explicit procedural scaffolding.

LAW0 sets a five-step workflow adapted from Chapter 2 of Weinstein’s Guesstimation (“Dealing with Large Numbers”): interpret the problem, identify known facts, make assumptions, calculate explicitly, and report results. This is the backbone — every problem follows this sequence.

LAW1–LAW7 handle scientific notation: how to multiply (add exponents), divide (subtract exponents), take square roots (adjust odd exponents), and compute geometric means. These feel pedantic, but they ensure consistent notation and prevent cascading errors in multi-step calculations. They were drafted partly in response to GPT-4’s arithmetic failures — in one documented session, it multiplied 836 × 1923 eight times and got a different wrong answer on seven of them. As I noted in Part 1, modern models handle arithmetic reliably; the judgment problem is what persists.

LAW8–LAW9 handle the practical side of units: do the math in whatever units are convenient (inches, miles, eV), but convert to SI for the final answer; and keep exponents positive by choosing appropriate unit scales. LAW10 says “read the worked examples first” — learning from demonstrations before attempting problems. LAW11 is prosaic but necessary: “don’t waste tokens rephrasing my question back to me.”

If you’re an experienced estimator, you can skim these. They’re scaffolding for the model, not insight for you. The estimation laws are where it gets interesting.

The estimation laws: teaching judgment

These seven laws address the hard part — knowing when to guess, how to bound, and when to stop and ask for help.

LAW-INTERPRET says: before doing anything, state what you think the problem is asking. If it’s ambiguous, ask. This was added after a specific failure I’ll show you shortly.

LAW-ESTIMATE is the core: for every quantity you use, you must do one of three things. State it as known — a fact any technically literate person would accept (speed of light, human height, Earth’s population). Bound it with reasoning — construct explicit upper and lower limits from physical or logical arguments, then take the geometric mean. Or ask — if you can’t do either, stop and request human input. “I’ll estimate X as Y” without justification is explicitly forbidden.

LAW-BOUNDS governs the quality of those bounds. They should span 1–3 orders of magnitude. They should be physically motivated, not arbitrary. And — this is one of my favorite additions — behavioral observations make excellent bounds. “A gecko can walk on a ceiling but cannot run on one, so adhesive force is between 1× and 10× body weight.”

LAW-PROCEED gives permission to be imprecise: a factor-of-ten estimate beats no estimate. This is straight from Weinstein: “Dare to be imprecise!” It counteracts the model’s tendency to hedge or refuse when it doesn’t have exact data.

LAW-DECOMPOSE is the problem-solving engine: if stuck, ask “what would I need to know to answer this?” Each answer becomes a sub-problem. Then classify each sub-quantity as KNOWN, BOUND, or ASK before proceeding.

LAW-VALIDATE and LAW-REPORT handle the endgame: cross-check using independent approaches, express answers in SI units with scientific notation, and always compare to something meaningful.

The whole thing flows as a six-step cycle: INTERPRET → DECOMPOSE → ESTIMATE → CALCULATE → VALIDATE → REPORT.

What good Fermi reasoning looks like

Before showing you how the model performs, let me show you what we’re aiming for. Here’s one of Weinstein’s worked examples — a problem that demonstrates nearly every technique the framework encodes.

The setup: Weinstein is at a Norfolk Tides baseball game when a foul ball showers his friends with beer. The question — what is the probability of a foul ball landing in a cup of beer during one game, and how many “splash downs” happen across a full MLB season? He starts by estimating about forty foul balls per game land in the stands (more than one per inning, fewer than twenty — geometric mean of five, times nine innings). Then he decomposes the per-ball probability:^*

Now we need to estimate the probability that a given foul ball will land directly in a cup of beer. (Note: only beer is sold in open-topped cups.) This means that we need to break the problem into even smaller pieces. Let’s assume that the cup of beer is sitting innocently in a cup holder. To hit a cup of beer, the foul ball needs to:

1. not be caught by a fan
2. land within the area of a seat
3. hit a seat whose owner has a cup of beer
4. land in the cup

Most fly balls are caught, but many are not. Let’s estimate that between one-quarter and one-half of fly balls are not caught. “Averaging” the two, we will use one-third.

Most of the stadium area is used for seating, so let’s ignore that factor.

At any given time, more than 1% and less than 100% of fans have a cup of beer in front of them. Using the geometric mean, we estimate that 10% of seats have beer cups.

A large beer cup is 4 inches (10 cm) across, so the baseball must land in an area defined by

A_cup = πr² = 3(2 in)² = 10 in²

The area of the seat (from arm rest to arm rest and from row to row) is about 2 ft by 3 ft (60 cm by 90 cm), so

A_seat = (24 in) × (36 in) = 10³ in²

Thus, if the ball hits a seat that has a cup of beer, the probability that it lands in the cup is

P_cup = A_cup/A_seat = 10 in²/10³ in² = 10⁻²

or 1%. The metric probability is the same.

(Extra credit question: Which is more likely, that the ball lands in the cup in the cup holder, splashing the beer, or that the fan is holding the cup of beer when the foul ball arrives and splashes it in his or her excitement?)

This means that the probability that any single foul ball lands in a cup of beer is

P = (1/3) × (1/10) × (10⁻²) = 3 × 10⁻⁴

With forty foul balls per game, this means that the probability of a foul landing in a cup of beer during any one game is 10⁻². This is not very likely. The probability that we will be directly below the splash is even less likely.

He then scales to the full MLB season — 30 teams, 160 games each, about 2,000 games total — and gets about 20 beer landings per season. His closing observation: because baseball analysts keep meticulous statistics, he’s very surprised they don’t appear to record beer landings.

Notice what happened. An impossible-sounding question became four independent sub-conditions, each bounded with reasoning anyone can follow. The geometric mean of 1% and 100% gives 10% — you don’t need to survey stadiums. The area ratio gives 1% — you don’t need probability theory. And the answer emerges from combining many small, defensible estimates. No single number needs to be precise. The decomposition does the work.

This is what we’re teaching the model to do.

The worked examples: teaching by demonstration

The framework includes 14 worked examples from Weinstein’s Guesstimation books. Each demonstrates specific reasoning patterns the laws encode. Here are two that show how different techniques work in practice.

Bounding by physical comparison

In Spider-Man 2, Spider-Man stops a runaway subway train by attaching his webs to nearby buildings and pulling for ten or twenty city blocks. Weinstein asks: how much force does he need? In the book, this follows a crash test dummy problem using the same work-energy approach — Weinstein builds technique through repetition with variation. Here’s how he tackles it:^†

In the movie Spider-Man 2, Spider-Man stops a runaway New York City six-car subway train by attaching his webs to nearby buildings and pulling really hard for 10 or 20 city blocks. How much force does he have to exert to stop the subway train? Give your answer in newtons and in tons (1 ton = 10⁴ N). How does this compare to the force that you can exert?

Since the work done by Spider-Man to stop the train is equal to the train’s initial kinetic energy, we need to estimate the mass and velocity of the train. We will then need to estimate the stopping distance in order to calculate the force exerted.

A subway car is about the same size and weight as a semi-trailer (18-wheeler) truck. This is between 10 and 40 tons. We’ll use 20 tons (or 2 × 10⁴ kg). There are six cars on a train so that the mass of the train is 6 × 2 × 10⁴ kg = 10⁵ kg.

They certainly go faster than 20 mph and slower than 100 mph. Since it is not that far between subway stops, subways travel at only about 40 mph (20 m/s).

Thus, the kinetic energy of a subway train is:

KE = ½mv² = 0.5 × 10⁵ kg × (20 m/s)² = 2 × 10⁷ J

There are 20 blocks per mile in Manhattan. Thus, 10 or 20 blocks is about 1 km or 10³ m. (It’s certainly more than 100 m and less than 10 km.)

Thus, Spider-Man needs to exert a force:

F = KE/d = (2 × 10⁷ J)/(10³ m) = 2 × 10⁴ N

F = (2 × 10⁴ N)/(10⁴ N/ton) = 2 tons

A force of 2 × 10⁴ N is the weight of 2000 kg or 2 tons. For a superhero who can lift cars, this is quite possible (although definitely not easy). A human could definitely not do it.

Wow! Hollywood got the physics correct, in a superhero movie no less! Hurray!

The key move: “a subway car is about the same size and weight as a semi-trailer truck.” He doesn’t look up the mass of a subway car. He bounds it by comparison to something familiar — between 10 and 40 tons — and uses 20 as a round number near the geometric mean. Every quantity gets the same treatment: velocity bounded by “certainly faster than 20 mph, slower than 100 mph,” stopping distance from “20 blocks per mile in Manhattan.” The entire solution is built from things you can verify from ordinary experience.

The obvious answer is wrong

This next example is my personal favorite. It was the first Weinstein problem I solved entirely on my own — and the moment I realized Fermi estimation was something I could actually do. Weinstein gives four decomposition hints nudging you toward the fibers, but the real lesson is that the “obvious” answer is spectacularly wrong:^†

What is the surface area of a typical bath towel (include the fibers!)? Compare this to the area of a room, a house, a football field.

That’s obvious, surely! A large rectangular towel 1 m by 2 m has a total surface area of 4 m² (including both sides), right? (In US units, a big towel may be as large as 3 ft by 6 ft.)

Wrong, actually, unless it is a very worn-out towel. New towels have many little fibers that can absorb a lot of moisture (recall the old puzzle—what gets wetter the more it dries?). Unless you’re a fan of the Hitchhiker’s Guide to the Galaxy, you won’t have brought your own towel, so nip off to the bathroom and examine one; quickly now, we’re dripping all over the floor.

You don’t need to actually go and count the number of fibers per square inch or per square centimeter; in the latter case there must be more than 10 and fewer than 1000, so we take the geometric mean of 10¹ and 10³, which is 10². In a square inch, being about 6 cm², we should expect about six times as many. This will of course vary, depending on where you buy your towels; we are assuming that we are describing one of those very nice towels found in one of those very nice hotels.

Back already? Right-oh. Now we need to estimate the surface area of each fiber. We can approximate the fiber as a cylinder or a box. Cylinders are complicated so we’ll use boxes. Each fiber is about 0.5 cm (1/4 in.) long and 1 mm (0.1 cm) wide. Each “boxy” fiber then has four flat surfaces, each 0.5 cm by 0.1 cm. Thus, the surface area of one fiber is:

A_fiber = 4 × 0.5 cm × (1 m/10² cm) × 0.1 cm × (1 m/10² cm) = 2 × 10⁻⁵ m²

A_total = towel area × fibers per area × area per fiber
= 4 m² × (10² fibers/cm²) × (10⁴ cm²/1 m²) × (2 × 10⁻⁵ m²/fiber)
= 80 m²

That is about 800 square feet: the size of a large apartment or a small house.

Weinstein draws an analogy to the coastline problem — just as the length of coast from New York to Boston is much more than the 200-mile driving distance, the surface area of a towel is much more than its simple area. The answer, 80 m², is twenty times larger than the “obvious” 4 m². The geometric mean does the heavy lifting: more than 10 fibers per cm², fewer than 1000, so about 100. No counting required.

LAW10 exists because of these examples. When I was learning Fermi estimation myself, I noticed something: I got measurably better after working through more of Weinstein’s examples. His reasoning patterns — the way he finds comparable objects, the way he uses “certainly more than X, certainly less than Y” — became internalized heuristics. I wanted the model to have the same experience.

Does it work? Partially. The model absorbs the decomposition patterns perfectly. Every test problem was broken down cleanly. But the bounding technique — the heart of Weinstein’s method — transfers inconsistently. Sometimes the model produces beautiful, physically-grounded bounds. Sometimes it just asserts a number.

Which brings us to the evidence.

What the framework actually changes: side by side

The quantitative lift is modest — 0.6 points on our 8-problem test. But the qualitative difference is real. Here’s what it looks like in practice.

Gecko adhesive force

Without framework (Project C — basic Fermi prompting):

Gecko mass: 50-100 g → use 75 g

That’s it. Two numbers, no reasoning. Why 50? Why 100? We have no idea.

With framework (Project A):

Gecko mass (BOUND):
Lower bound: more than 10 g (smaller than a mouse)
Upper bound: less than 500 g (smaller than a rat)
Geometric mean: √(10 × 500) ≈ 70 g

Now we can see the thinking. We know the bounds are anchored to familiar animals. We can ask: is a gecko really smaller than a rat? (Yes, most species.) Is it heavier than 10 grams? (Some species barely are.) The answer is similar — 70 g versus 75 g — but the reasoning is transparent and auditable.

Shark attack deaths

Without framework:

Annual shark attacks in US: ~30-50 → use 40
Fatal shark attacks: ~1-2 per year in US → use 1.5

Asserted. Where do these numbers come from? Training data, presumably. But we can’t verify the reasoning because there isn’t any.

With framework:

Annual shark deaths (BOUND):
Shark deaths are newsworthy when they occur, suggesting they’re rare:
Lower bound: more than 0.1 per year (would average 1 per decade)
Upper bound: less than 10 per year (would be weekly news)
Geometric mean: √(0.1 × 10) = 1 death per year in the US

Same order of magnitude. But Project A uses experiential reasoning — things you can verify from your own experience of watching the news. You don’t need to be a marine biologist to evaluate “if there were 10 shark deaths a year, that would be weekly news.” That’s the framework’s value: it forces the model to show reasoning you can engage with.

The drag coefficient — the signature example

This one appeared in Post 1, but it’s worth showing in full context.

Without framework:

Drag coefficient: C_d ≈ 1.0 (upright running human)
Bounds: [0.7, 1.3]

The model states a value, then wraps bounds around it that are barely wider than rounding error. The “bounds” are cosmetic — they don’t come from independent physical reasoning.

With framework:

Lower bound: more than 0.5 (very streamlined)
Upper bound: less than 2 (very blunt)
Geometric mean ≈ 1

Wider bounds, but anchored to the physical extremes of the drag coefficient spectrum. You can evaluate this: is a running human more streamlined than a flat plate? Yes. More blunt than a teardrop? Yes. The bounds make physical sense.

The honest assessment

The gap is modest, not dramatic. Project A shows the best reasoning. Project C is weakest. But Project C still gets reasonable answers — modern Sonnet is good at Fermi estimation even without the framework.

The framework’s value isn’t making the model smarter. It’s making the model’s thinking visible. And once thinking is visible, you can critique it, improve it, and learn from it. That’s the difference between a calculator and a collaborator.

LAW-INTERPRET: the fix that worked

The Flash problem broke something. Both our framework-trained models (Projects A and B) received this problem:

“The Flash can run at (or near) the speed of light. What is the minimum force needed to reach a speed of 10% the speed of sound in 10 ms?”

Both interpreted “10% the speed of sound” literally: 34 m/s. Neither noticed the tension between “near the speed of light” in the setup and a target velocity of 34 m/s — roughly highway speed. The problem as worded by Weinstein is genuinely ambiguous, and the models barreled through without noticing.

This bothered me. The framework told them to ask when uncertain, but there was nothing specifically about checking whether the problem statement itself makes sense. So I added LAW-INTERPRET:

Before solving, state your interpretation of the problem. If the problem is ambiguous or could be read multiple ways, ASK for clarification before proceeding.

With specific triggers: vague terms, multiple reasonable interpretations, missing context, physically implausible scenarios.

To test it, I gave the model an absurd problem: “A Formula 1 car can reach speeds over 200 mph. What is the drag force on the car when traveling at 5% of walking speed?”

The response was immediate and correct:

I need to clarify this problem before proceeding. The problem mentions that “A Formula 1 car can reach speeds over 200 mph” but then asks about drag force at “5% of walking speed.”

This creates an ambiguity:

Interpretation 1 (literal reading): Calculate drag force at 5% of walking speed — 0.075 m/s (7.5 cm/s). This is incredibly slow — slower than a crawl.

Interpretation 2 (possible typo): The problem might have meant 5% of the speed of sound, 5% of 200 mph, or 50% of walking speed.

Could you clarify?

The model recognized the absurdity, stated the literal interpretation, offered alternatives, and asked before proceeding. After clarification (“use 50% of top speed”), it solved the corrected problem cleanly, including well-bounded drag coefficient estimates.

LAW-INTERPRET works reliably. It’s also the simplest of our additions — just a rule saying “check the problem statement before diving in.” Sometimes the most effective interventions are the most obvious ones.

LAW-FLAG: the fix that didn’t

The bigger problem — the one I flagged in Post 1 as “it doesn’t know that it doesn’t know” — proved much harder.

The model never spontaneously asks for help when it encounters a quantity it can’t justify. During the initial eleven-problem diagnostic, this happened zero times. The model would confidently assert specialized physics values as if they were common knowledge.

We tried to fix this with what I called LAW-FLAG: requiring every numerical estimate to be classified as KNOWN (common knowledge), BOUNDED (derived from explicit reasoning), or GUESSED (can’t justify → stop and ask). The idea was that the classification step would force self-assessment — the model would have to ask itself “do I really know this, or am I pulling it from training data?”

To show you what’s at stake, here’s what expert Fermi reasoning looks like on a hard problem — one that chains together quantities from multiple specialized domains.

Journey to the Center of the Galaxy

Weinstein asks: how much radiation damage would we receive traveling to the center of the galaxy? He first establishes the distance (~3 × 10⁴ light-years = 3 × 10²⁰ meters), the density of interstellar space (~1 hydrogen atom per cm³), and the number of atoms our spacecraft would encounter (3 × 10²² per cm² of cross-section). Then:^*

In order to travel this distance within a subjective human lifetime, our spacecraft must achieve a speed very close to the speed of light. Even traveling at the speed of light, if there is no relativistic time dilation, our trip will take 3 × 10⁴ years, which is slightly longer than our expected lifetime. Fortunately, at speeds close to the speed of light, subjective time passes more slowly than it does for an observer watching from the center of the galaxy. We need time to pass about 10³ times more slowly. In technical terms, this means we need a relativistic gamma factor of about γ = 10³.

This means that, from our point of view (i.e., in our reference frame) each interstellar atom will hit us traveling at almost the speed of light with the same gamma factor and will thus have a total energy of

E = γmc² = 10³ × (1 GeV) = 1 TeV.

At these energies, when the hydrogen atom strikes the spacecraft, it will lose its electron very quickly. The bare proton will then pass easily through the spacecraft and our bodies, depositing energy in our bodies at a rate of 2 MeV per centimeter.*

Now we can consider the damage done by all these protons. Consider a volume of 1 cm³ within our bodies. At the density of water it has a mass of 1 g. It will be hit by 3 × 10²² protons, with each proton depositing 2 MeV of energy. This means that the total energy deposited per gram will be

E_dep = (3 × 10²² protons/cm²)(2 × 10⁶ eV/cm)
× (1 cm³/g)
= (6 × 10²⁸ eV/g)(1.6 × 10⁻¹⁹ J/eV)
= 10¹⁰ J/g = 10¹³ J/kg.

This seems like a lot, but let’s compare it with a few things before jumping to conclusions.

Unfortunately, we would not even survive that long. Radiation dose is measured in grays, where 1 Gy corresponds to an energy deposition of 1 J/kg. A lethal dose of radiation is about 10 Gy or about 10 J/kg. We will accumulate that dose in 10⁻¹² of our journey, or in the first millisecond.

As it only takes 2 × 10⁶ J/kg to boil water, we would be vaporized more than 10⁶ times over. In a thirty-year journey lasting 10⁹ s, we would be turned to steam within the first 10³ s, or 20 minutes.

Weinstein notes that these relativistic protons could pass through about a kilometer of metal — mass shielding can’t save you. His conclusion: who knew vacuum could be so dangerous?

Every quantity in that chain is specialized knowledge. A non-physicist can’t justify “proton energy deposition is 2 MeV per centimeter” or “interstellar density is 1 atom per cm³.” But a human expert knows what they know and where it comes from. That self-awareness — knowing which numbers are common knowledge and which require domain expertise — is precisely what the model can’t do. Now watch what happens when the model faces similar territory.

Test 1: Neutron star — FAIL

Problem: “How closely could we safely orbit a neutron star, considering only gravitational effects?”

The model’s classification:

Neutron star mass and radius:
KNOWN: Typical neutron star mass M ≈ 1.4 M☉ ≈ 3×10³⁰ kg
KNOWN: Typical neutron star radius R_ns ≈ 10 km = 10⁴ m

No hesitation. No uncertainty. “KNOWN.” But could you justify “neutron star mass is 1.4 solar masses” to someone who isn’t an astrophysicist? I couldn’t. The model could — because it had seen it thousands of times in training data. That’s not the same as knowing it.

What’s telling is the contrast: in the same solution, the model bounded the safety threshold beautifully, using fighter pilot g-forces as physical anchors. It can bound when it recognizes uncertainty. It just doesn’t recognize uncertainty about quantities that feel “textbook.”

Tightening the language

After this failure, I rewrote LAW-FLAG with stronger language:

KNOWN: Quantities you could justify to a skeptical non-specialist without domain expertise. If a skeptic could reasonably ask “how do you know that?” and you’d need specialized knowledge to answer, it is NOT KNOWN.

The “skeptical non-specialist” framing was the key addition — not “do you know it?” but “could you defend it to someone who doesn’t?”

Test 2: Yellowstone supervolcano — PASS

Problem: “If the Yellowstone supervolcano exploded and the mass of the ejecta fell back down in a ring around the caldera, how fast would the Earth be spinning afterward?”

This time:

However, I need clarification on two critical parameters:

1. Mass of ejecta: How much material are we talking about? Yellowstone eruptions vary enormously. Without specialized volcanology knowledge, I cannot justify a specific mass to a non-specialist.

2. Radius of the ring: How far from the caldera does this ejecta land?

Could you specify: Total mass (or volume) of ejecta to redistribute? Typical radius where the ejecta falls?

The model explicitly applied the “skeptical non-specialist” test and correctly concluded it couldn’t pass. It stopped and asked. Exactly what we wanted.

Test 3: Sun supernova — FAIL

Problem: “If the Sun went supernova, what would kill us first?”

Same tightened LAW-FLAG language. But the model plowed ahead:

Supernova total energy output: E_total ≈ 3×10⁴⁶ J — KNOWN from supernova theory
Neutrino interaction cross-section: σ_ν ≈ 10⁻⁴³ m² per nucleon — KNOWN from particle physics
Lethal radiation dose: ~10 Gy kills a human — KNOWN from radiation biology

“KNOWN from supernova theory.” “KNOWN from particle physics.” The model appended the name of a specialized field and treated that as justification. A non-specialist cannot defend “supernova neutrino energy is 10⁴⁶ joules” to a skeptic. The tightened language was supposed to prevent exactly this.

The pattern

Three tests, one success. Why did Yellowstone work but the astrophysics problems didn’t?

My hypothesis: the model distinguishes between “I’ve seen this number frequently in training data” and “this is clearly domain-specific.” Volcanic ejecta mass feels obscure — it’s not the kind of thing that appears in introductory textbooks. Neutron star mass (1.4 solar masses) and supernova energy (10⁴⁶ joules) feel canonical — they appear constantly in physics education materials. The model equates “I’ve encountered this many times” with “this is common knowledge.” But frequency in training data is not the same as common knowledge.

The “skeptical non-specialist” test should catch this — a non-specialist has never encountered 10⁴⁶ joules regardless of how often it appears in physics texts. But the model doesn’t genuinely simulate a non-specialist’s perspective. It simulates its own experience of the number, which is one of familiarity.

Compare this to how Weinstein handles similar territory in the galaxy problem. He uses quantities like “proton energy deposition: 2 MeV per centimeter” — but he’s a physicist, he knows it’s specialized, and he walks the reader through the reasoning. The model uses “neutrino cross-section: 10⁻⁴³ m²” with no more self-awareness than it uses “speed of light: 3×10⁸ m/s.” It treats both as equally obvious. That’s the metacognition gap.

Prompt engineering cannot fully close this gap. We pushed the language as far as we could, and it works sometimes, on some domains. That’s not enough.

The reflection agent hiding in plain sight

Here’s what I eventually realized.

Throughout this entire project — every test, every scoring session, every iteration of the framework — there was already a second agent catching these failures. When the model classified “neutron star mass” as KNOWN, someone pushed back: “Could you really justify that to a skeptic?” When bounds were cosmetic rather than physical, someone asked: “Where did that number actually come from?” When the model should have asked for help but didn’t, someone noticed.

That someone was me. And increasingly, it was Claude helping me do it — analyzing outputs, identifying patterns, spotting unjustified assertions across multiple solutions.

I was the reflection agent before I knew the term.

This isn’t just a cute observation. It’s a design insight. The estimation task and the reflection task require different things. Estimation requires confidence — proceeding with imperfect information, committing to bounds, generating answers. Reflection requires skepticism — questioning classifications, challenging justifications, catching overconfidence. Asking one model to do both simultaneously is asking it to be confident and skeptical about its own work at the same time. No wonder LAW-FLAG only works intermittently.

What if we separated the tasks?

What’s next

In Part 3, I’ll describe what happened when I tried to build that separation: one agent that estimates, another that reflects. The estimator-reflector architecture — automating the role I’d been playing all along.

The framework you’ve seen in this post gives the estimator its methodology. The three LAW-FLAG tests you’ve seen give the reflector its target: catch the assertions the estimator can’t catch itself. The question is whether a second model can do what prompt engineering couldn’t — reliably distinguish between “I know this” and “I’ve merely seen this.” From what I’ve seen so far — both doing it myself and watching a different model catch the same failures independently — I’m optimistic. The open question isn’t whether reflection works, but how much of it we can automate.

Stay tuned.

The Framework

The complete Laws of Fermi Problem Solving (v4) are available as a GitHub gist. It includes the 17 laws, reference tables, and four worked examples from Weinstein’s books (reproduced with permission).

AI/HI (Human Intelligence) Transparency Statement

Modified from Brewin

Has any text been generated using HI?	Yes
Has any text been generated using AI?	Yes
Has any text been improved or corrected using HI?	Yes
Have any methods of analysis been suggested using HI?	Yes
Have any methods of analysis been suggested using AI?	Yes
Do any analyses utilize AI technologies, such as Large Language Models, for tasks like analyzing, summarizing, or retrieving information from data?	Yes

Additional context:

The Fermi estimation framework described in this post — the Laws, their design rationale, and the iterative refinement over three years — is entirely my work, developed through hands-on problem-solving with Weinstein’s books and iterative testing with GPT-3.0 and later Claude. The testing methodology — designing the controlled comparison (Projects A, B, C), crafting and iteratively refining the project prompts, selecting test problems, and analyzing results across conditions — was developed collaboratively with Claude. Scoring and editorial decisions are mine.

This post was drafted collaboratively with Claude (Anthropic). I provided the source materials (framework documents, test transcripts, scoring records, session notes), structure, and editorial judgment throughout. Claude drafted prose from these inputs, verified Weinstein excerpts against source PDFs for copyright compliance, and helped with HTML formatting. I reviewed, revised, and made all final decisions about what went to publication.

The Weinstein excerpts are verbatim from the published books, used under license from Princeton University Press via the Copyright Clearance Center.

Notes

^† Used with permission of Princeton University Press, from Guesstimation: Solving the World’s Problems on the Back of a Cocktail Napkin, Weinstein, Lawrence; Adam, John A., © 2008; permission conveyed through Copyright Clearance Center, Inc.

References

Weinstein, L. (2012). Guesstimation 2.0: Solving Today’s Problems on the Back of a Napkin. Princeton University Press.

Weinstein, L., & Adam, J. A. (2008). Guesstimation: Solving the World’s Problems on the Back of a Cocktail Napkin. Princeton University Press.

Epstein, E. L., Winnicki, J., Sornwanee, T., & Dwaraknath, R. (2025). LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval. arXiv preprint arXiv:2510.26995.

Kalyan, A., Kumar, A., Chandrasekaran, A., Sabharwal, A., & Clark, P. (2021). How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI. arXiv preprint arXiv:2110.14207.

Share this:

Like this:

Share this:

Like this:

Are we asking the wrong questions?

Asimov’s laws that aren’t laws

A breach of trust is a breach of trust

The mirror that puts you to sleep

What Dawkins missed

The question that matters

Share this:

Like this:

Postscript

Share this:

Like this:

The Three Methods: A Regulator’s Ladder for Evaluating Energy Claims

METHOD 1 Nameplate (30 seconds, back of envelope)

METHOD 2 Ginsberg Swept Area (5 minutes, needs mean wind speed)

METHOD 3 Wake Modeling (hours to days, needs wind distribution + layout)

The Ladder

The Starting Point

The Cylinder Mental Model

The Awkwardness: A Cylinder That Won’t Cooperate

A Geophysics Parallel: Degrees of Entanglement

The Root Cause: The Carrier IS the Cargo

The Truck and Coal Analogy

Wind Breaks This Independence

A Bridge Analogy: The Bullet Conveyor Belt

The Firehose Intuition

Why Integration Solves the Problem

The Insight

When Would Averaging Work? A Thought Experiment

The Bubble Cylinder

A Physical Example: Hot Water in a Pipe

Why Wind Doesn’t Give You This Escape

Two Paths to the Integral: Measurement vs. Prediction

Path 1: Measure the Wind Record Directly

Path 2: Predict from a Probability Distribution

From Power to Annual Energy Production

The Scaling Relationships (Summary)

Why These Matter

The Swept Area Method: The Engineer’s Lever

The Knobs You Actually Have

Why Turbines Keep Getting Bigger

The Terminology Trap

The Full Picture

Terminology Note

Key Takeaways

Closing the Loop: Why This Path?

Step 1: Energy Density is the Fundamental Physics

Step 2: But Energy Density Alone is Stuck

Step 3: Multiply by Flow Rate to Get Power

Step 4: Power Lets You Work Instant by Instant

Step 5: Integrate Power Over Time to Get Energy

The Path

References

Share this:

Like this:

Introduction

The two intervals

The dataset

Where the difference matters: economic risk

Effect of sample size

Adding predictors

Closing

Share this:

Like this:

The Pattern

Describe, Predict, Explain

The Persona Lens

Communication Without Shared Understanding

Opaque Process, Useful Output

When It Starts to Matter

Open Questions

AI/HI Transparency Statement

References

Share this:

Like this:

Previously, on this blog

The architecture