How many piano tuners are in Chicago?
That’s one of the famous questions Enrico Fermi posed to his physics students. Not because he cared about piano tuners, but because he wanted to teach them how to reason through problems with limited information. You don’t need to know the exact answer – you need to know how to think about it.
Let’s break it down: Chicago has maybe 3 million people. Perhaps 1 in 20 households has a piano – certainly not as common as TVs (nearly everyone), but more common than harps (quite rare). That’s roughly 150,000 pianos. Each piano needs tuning once a year. A tuner can probably do 4 pianos a day – more than 1 (they’d barely make a living), fewer than 10 (each tuning takes a couple hours) – so maybe 1,000 per year. That means you need about 150 tuners.
NASA even uses it to teach estimation. You got a reasonable answer using nothing but logical decomposition and bounded guesses.
This is Fermi estimation. And I tried to teach an AI to do it.
Why this matters
I’ll be honest: this project started with a bigger ambition than solving physics puzzles. I wanted to know if AI could become a partner in thinking critically about numbers – not just as a calculator, but as a collaborator in quantitative reasoning. One inspiration was Carl Bergstrom and Jevin West’s book Calling Bullshit. They describe what they call the “new bullshit”: using data, statistics, and visualizations to construct the appearance of scientific rigor – giving false authority to misleading claims. It’s sophisticated deception dressed up as analysis. The book teaches a skill: detecting and dismantling this kind of numerical misdirection. Fermi estimation is a different but complementary skill – it exercises similar muscles: decomposing claims, questioning assumptions, sanity-checking results. The dream was an AI partner for this kind of thinking. Not an oracle that gives you answers, but a collaborator that helps you interrogate whether answers are reasonable. It was Bergstrom and West’s YouTube lectures that first introduced me to Weinstein’s Guesstimation books – which became the methodological backbone of this project.
The frustration
Let me take you back to 2023. I was excited – maybe naively so – about using LLMs for Fermi estimation. These models had access to vast amounts of physics knowledge, and they were supposedly good at reasoning.
But here’s the thing: they were terrible at arithmetic. Not just GPT-3.0 – the problem persisted through 3.5 and even 4.0. I documented one session where GPT-4 attempted to multiply 836 x 1923 eight times. It got the correct answer exactly once – then regressed to different wrong answers on subsequent attempts. It could describe the long multiplication algorithm perfectly while executing it incorrectly. Each attempt produced a different error. There was no stable learning, just occasional lucky generations.
The arithmetic failures made it impossible to even evaluate whether the models could do Fermi estimation. Was the model struggling with decomposition? With bounding? With proceeding under uncertainty? I couldn’t tell – every attempt would derail on basic multiplication or unit conversion long before I could assess the reasoning quality.
The model would freeze when faced with uncertainty. It would apologize profusely for not having exact values. When it did proceed, it would present numbers with no reasoning, no bounds, no indication that this was a guess rather than a fact. Where did that number come from? Is it reasonable? You have no way to know. But I couldn’t systematically address any of this while arithmetic kept breaking.
I started drafting what would become the framework – ideas about decomposition, bounding, proceeding with uncertainty. But I couldn’t finalize anything. The arithmetic problem had to be solved first.
Note: Modern models with carefully designed custom instructions handle arithmetic more reliably. I know it for a fact (I tested it on ChatGPT 4.5, adapting my instruction from Jeremy Howard’s recommendations in A Hackers’ Guide to Language Models, and got the same results on the tricky questions he works through). But the judgment problem remains unsolved.
The Wolfram plugin: the turning point
After documenting these failures, I added the Wolfram plugin to ChatGPT (Wolfram, 2023). That’s when things changed enough to really push on the Fermi idea.
Wolfram fixed the arithmetic: unit consistency improved, scientific notation became reliable, basic math errors disappeared. The model could finally compute accurately and consistently.
And that’s when I could finally see what remained.
The model could compute, but it couldn’t estimate. It still wouldn’t commit to bounded guesses. It still hedged or refused when faced with uncertainty. With arithmetic no longer obscuring the problem, the real bottleneck became obvious:
The model could calculate, but it couldn’t reason like Fermi.
This led to a crucial realization. There were two different problems requiring two different solutions:
1. Arithmetic reliability -> Wolfram plugin solved this
2. Estimation judgment -> Required something else entirely
The bottleneck wasn’t calculation. It was the model’s inability to proceed confidently with explicit uncertainty – to say “I don’t know this exactly, but I can bound it” and then move forward.
Only then could I finalize the framework.
Drafting the Laws: permission to estimate
With computation outsourced to Wolfram, I could finally focus on the estimation problem itself. I drafted what I called “the Laws” – giving the model permission and methodology to reason under uncertainty.
The framework I drafted wasn’t pulled from thin air. The methodology itself comes from Enrico Fermi – this is how he trained his physics students to reason and experiment in their heads. Lawrence Weinstein took that foundational approach and built something substantial: a rich, modern pedagogical method with brilliant and often humorous worked examples in his Guesstimation books (Weinstein & Adam, 2008; Weinstein, 2012). My Laws were an attempt to translate Weinstein’s approach into something an AI could follow – to adapt it from human pedagogy to human-AI partnership.
The eleven Laws I drafted addressed both the arithmetic mechanics and the estimation judgment:
LAW0: A mandatory 4-step process with explicit check-ins: “you must ask me if you have missed any facts,” “you must ask me if I can provide some constraint.” The collaborative DNA was there from the start – I knew the model needed human input. It just wouldn’t ask for it, even when told.
LAW1-LAW9: Scientific notation rules, arithmetic procedures, and unit handling – establishing a procedure the model could reliably follow, making the reasoning transparent and auditable. With Wolfram handling the computation, these laws ensured consistency and transparency in how calculations were performed and reported. LAW8-LAW9 specifically addressed unit conversions and ensuring final answers were always in SI units.
LAW10: “Read all examples before solving.” I was observing my own learning process – how I improved as I worked through more of Weinstein’s examples, how I absorbed his thinking patterns, how my decomposition and bounding strategies refined with practice. I replicated that learning process for the AI by providing worked examples.
LAW11: “Don’t reprint my prompt” – fighting the model’s tendency to waste tokens repeating instructions back to me.
The framework combined reliable computation (Wolfram) with structured methodology (the Laws). It helped – but the results were inconsistent. The model would follow some Laws, ignore others, and systematic improvement remained elusive.
The real breakthrough would have to wait for better models and a rebuilt framework.
Building something better
Fast forward to late 2025. Better models, more experience, and lessons learned from that 2023 framework. I rebuilt it from scratch – not to compensate for failure, but to enable even greater success.
The new framework has 17 Laws: 12 mechanical (arithmetic, units, notation) and 5 focused specifically on estimation judgment. It includes 14 worked examples demonstrating complete reasoning patterns. The core shift was from defensive rules to empowering principles (“bound uncertain quantities with explicit reasoning,” “classify every estimate as KNOWN, BOUNDED, or ASK”).
The 2023 framework worked – it just wasn’t refined enough. The late 2025 version is more systematic, more explicit about judgment, and designed from the ground up for modern models that can reliably compute but still struggle with metacognition.
I tested it with Claude Sonnet 4.5. Here’s a before/after on that drag coefficient:
Before framework:
Model writes: “The drag coefficient for an F1 car is 0.9”
Where did 0.9 come from? Is that reasonable? We have no way to know.
After framework:
Model writes: “Lower bound: more than 0.5 (very streamlined). Upper bound: less than 2 (very blunt). Geometric mean ~ 1.”
Now we see the thinking. We know what the bounds are. Even if you don’t know fluid dynamics, you can think: “is an F1 car closer to streamlined or blunt?” You can disagree intelligently.
The framework doesn’t make the model know more. It makes the model show its uncertainty instead of hiding it behind assertions that sound confident but can’t be audited.
The results: does the framework actually help?
To find out, we ran a controlled comparison. Same eight problems, same model (Sonnet 4.5), same system-wide preferences. The only difference: one condition got the full framework (17 Laws, 14 worked examples, explicit KNOWN/BOUND/ASK classification requirements), while the other got only basic Fermi methodology – “decompose problems, bound uncertain quantities with upper and lower limits, take geometric means, show your work.”
The results:
- Minimal prompting: 7.9/10
- Full framework: 8.5/10
That’s a modest lift – 0.6 points – and with only eight problems and a single evaluator, it’s a directional finding, not a statistically significant one. The model can do reasonable Fermi estimation with just basic methodology concepts. Modern LLMs have improved dramatically since 2023.
But the numbers don’t capture everything. Reading the solutions side by side, the framework-trained answers show their reasoning differently. Instead of “drag coefficient ~ 1,” you get explicit bounds with justification. The answers are more auditable, more teachable, more useful for collaboration.
We also tested whether providing reference tables (order-of-magnitude values for common quantities) would help. Score: identical. But the model cited its sources explicitly – “air density: 1.2 kg/m^3 (from reference tables)” – instead of just labeling values as “KNOWN.” Same accuracy, better transparency.
I also ran an earlier exploratory test on eleven different problems from Weinstein’s books. That diagnostic helped shape the framework, revealing where the model needed more guidance. The patterns were consistent: decomposition excellent, bounding inconsistent, asking for help essentially never.
The framework’s value isn’t dramatic capability lift. It’s structure that enables auditing, teaching, and targeted improvement.
What works excellently:
- Decomposition: Perfect. Every single problem was broken down cleanly.
- Forward progress: The model proceeds confidently with bounded estimates.
- Numerical accuracy: Usually within an order of magnitude of Weinstein’s solutions.
- Comparisons: Good contextualization of answers.
What works when tested explicitly:
- Interpretation of ambiguous problems: We added a law (LAW-INTERPRET) that tells the model to pause and ask when a problem statement seems weird. We tested it with an absurd prompt – an F1 car at “5% of walking speed.” The model correctly recognized the absurdity, stated its interpretation, and asked for clarification. This works.
What’s inconsistent:
- Bounding quality: Sometimes excellent, sometimes the model retrieves numbers from training without explicit justification. The model doesn’t know when it’s doing bounding well versus when it’s pulling data without reasoning. The numbers are often correct, but the reasoning isn’t transparent – you can’t audit the thinking or learn from the process.
What almost never triggers:
- Asking for help on uncertain quantities: Zero times out of eleven problems initially.
As I noted in my evaluation:
“Sonnet is doing extremely well at understanding the problems, and the decomposition is always spot on. The answers are very reasonable, often close to Weinstein’s. However, its bounding is weak. Sometimes it gets it excellently, but more often than not, it gets maybe one bound but completely misses another. And it does not know it does, so it just pulls data out of nowhere, rather than asking.”
That last part is crucial: it doesn’t know that it doesn’t know.
The attempted fix
We tried to solve this. We added a law (LAW-FLAG) requiring the model to classify every estimate:
- KNOWN: Facts any reasonable person would accept (speed of light, human height)
- BOUNDED: Derived from explicit upper/lower reasoning
- GUESSED: Can’t justify -> STOP and ask
Our first version failed. On a neutron star problem, the model classified “neutron star mass ~ 1.4 solar masses” as KNOWN. Technically it’s in the training data, but could you justify that to a skeptic who isn’t an astrophysicist?
We tightened the language: “KNOWN applies only to quantities you could justify to a skeptical non-specialist.”
This worked… once. On a Yellowstone supervolcano problem, the model stopped and asked for the mass of magma. Progress! But on an astrophysics problem (what would kill us first if the Sun went supernova), it plowed ahead with “KNOWN from supernova theory” and “KNOWN from particle physics.”
Three tests, one success. The fix isn’t reliable.
The evolution summarized
| Capability | GPT-3.0 (2023) | Sonnet 4.5 + Framework (late 2025) |
|---|---|---|
| Decomposition | Inconsistent | Excellent |
| Arithmetic | Poor (plugin helped) | Solid |
| Bounding | Weak | Sometimes excellent, can’t self-assess |
| Proceeds confidently | Often refused | Yes |
| Asks when problem is ambiguous | Never | Yes (LAW-INTERPRET works) |
| Asks when stuck on values | Never | Rarely (1/3 tests) |
| Numerical accuracy | Poor | Good |
| Comparisons | Rare | Good |
The gap has narrowed significantly. What remains is the hardest part: metacognition about its own uncertainty. And I’ll be honest-bounding is at least as important as decomposition in Fermi’s approach. Without reliable bounding and self-awareness, the model gets good answers but can’t teach you how it got them or give you confidence about why they’re reasonable.
Where this fits in the research
More recently, I found out that researchers at Allen AI found that fine-tuned LLMs were off by two orders of magnitude on average when solving Fermi problems (Kalyan et al., 2021). Our framework shows similar patterns: decomposition is now essentially solved (perfect scores), and numerical accuracy is much improved (most answers within 10x).
But the deeper problem remains unsolved. The model doesn’t know what it doesn’t know. It produces bounds when it has familiar anchors (everyday quantities, mathematical relationships), but pulls numbers from nowhere when facing exotic physics – and doesn’t recognize the difference. It never asks for help, even when it should. Recent work on LLM calibration confirms this: models are systematically overconfident, with nominal 99% confidence intervals covering truth only 65% of the time (Epstein et al., 2025).
We’ve made progress on the methodology of Fermi estimation. The metacognition gap – knowing when you’re guessing-remains open.
Is the framework worth it?
That depends on what you want. If you just need a rough estimate, modern Sonnet can produce decent answers with minimal prompting (7.9/10). But if you want Fermi-style reasoning – methodical, transparent, bounded estimation – then yes, the structure matters.
Yet the framework alone isn’t the full solution. Throughout this project, reflection was the critical ingredient: challenging unjustified assertions, asking “could you defend that bound to a skeptic?” Catching when the model should have asked for help but didn’t.
The framework makes reasoning visible. Reflection makes it rigorous.
And that reflection layer – currently human, potentially multi-agent in the future – is what Part 3 will explore.
What’s coming next
In Part 2, I’ll share the framework itself – all 17 Laws, the structure, and several worked examples. I’ll show you what works reliably and what we tried to fix.
But I’ll also share what I eventually realized. Throughout this whole project, there was already a “reflection agent” catching assertions without reasoning and challenging unjustified claims. Asking “could you really justify that to a skeptic?” Flagging when the model should have asked for help.
It was me. The human in the loop, with enlisted help from Claude.
In Part 3, we’ll explore what that means. If the model can’t reliably judge its own uncertainty, maybe the solution isn’t better prompting – it’s separating the tasks. One agent estimates. Another agent reflects. That’s the architecture I was already using before I figured it was a “thing”.

A confession
While drafting this post, I asked Sonnet (different instance) to help write an opening example – a bullet fired from Los Angeles, could it reach you in San Diego? Sonnet wrote “the bullet stops in about 2 kilometers” without showing any derivation.
When I pushed back, it admitted: “I just did what we’re critiquing – presented a number without showing the work.”
Then we tried to verify it together. Constant deceleration gives roughly 600 meters – I verified this with ballistic calculators. Where did that 2 km come from? We still don’t know.
That’s why I replaced the bullet problem with the piano tuner question – something we could verify. And that’s why this work matters.
Stay tuned.
References
Bergstrom, C. T., & West, J. D. (2020). Calling Bullshit: The Art of Skepticism in a Data-Driven World. Random House.
Epstein, E. L., Winnicki, J., Sornwanee, T., & Dwaraknath, R. (2025). LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval. arXiv preprint arXiv:2510.26995.
Kalyan, A., Kumar, A., Chandrasekaran, A., Sabharwal, A., & Clark, P. (2021). How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI. arXiv preprint arXiv:2110.14207.
Weinstein, L. (2012). Guesstimation 2.0: Solving Today’s Problems on the Back of a Napkin. Princeton University Press.
Weinstein, L., & Adam, J. A. (2008). Guesstimation: Solving the World’s Problems on the Back of a Cocktail Napkin. Princeton University Press.
Wolfram, S. (2023). ChatGPT Gets Its “Wolfram Superpowers”! Stephen Wolfram Writings. Retrieved from https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/






































