Previously, on this blog
In Part 1 of this series, I told you the story: how I tried to teach AI to do Fermi estimation, hit a wall with GPT-3’s arithmetic, got past it with the Wolfram plugin, and discovered the real problem wasn’t calculation — it was judgment. I showed you the results: a framework that lifts scores modestly (7.9 to 8.5 out of 10) but transforms the reasoning from opaque assertions into auditable, bounded estimates.
I promised to show you the framework itself. Here it is.
The architecture
The framework has two layers. The first handles mechanics — how to do the math. The second handles judgment — how to think about what you know and don’t know. The mechanical layer matters less than you’d expect. The judgment layer matters more.
The mechanical laws (LAW0–LAW11)
These twelve laws govern arithmetic, notation, and reporting. They exist because early LLMs were unreliable at basic computation, and even modern models benefit from explicit procedural scaffolding.
LAW0 sets a five-step workflow adapted from Chapter 2 of Weinstein’s Guesstimation (“Dealing with Large Numbers”): interpret the problem, identify known facts, make assumptions, calculate explicitly, and report results. This is the backbone — every problem follows this sequence.
LAW1–LAW7 handle scientific notation: how to multiply (add exponents), divide (subtract exponents), take square roots (adjust odd exponents), and compute geometric means. These feel pedantic, but they ensure consistent notation and prevent cascading errors in multi-step calculations. They were drafted partly in response to GPT-4’s arithmetic failures — in one documented session, it multiplied 836 × 1923 eight times and got a different wrong answer on seven of them. As I noted in Part 1, modern models handle arithmetic reliably; the judgment problem is what persists.
LAW8–LAW9 handle the practical side of units: do the math in whatever units are convenient (inches, miles, eV), but convert to SI for the final answer; and keep exponents positive by choosing appropriate unit scales. LAW10 says “read the worked examples first” — learning from demonstrations before attempting problems. LAW11 is prosaic but necessary: “don’t waste tokens rephrasing my question back to me.”
If you’re an experienced estimator, you can skim these. They’re scaffolding for the model, not insight for you. The estimation laws are where it gets interesting.
The estimation laws: teaching judgment
These seven laws address the hard part — knowing when to guess, how to bound, and when to stop and ask for help.
LAW-INTERPRET says: before doing anything, state what you think the problem is asking. If it’s ambiguous, ask. This was added after a specific failure I’ll show you shortly.
LAW-ESTIMATE is the core: for every quantity you use, you must do one of three things. State it as known — a fact any technically literate person would accept (speed of light, human height, Earth’s population). Bound it with reasoning — construct explicit upper and lower limits from physical or logical arguments, then take the geometric mean. Or ask — if you can’t do either, stop and request human input. “I’ll estimate X as Y” without justification is explicitly forbidden.
LAW-BOUNDS governs the quality of those bounds. They should span 1–3 orders of magnitude. They should be physically motivated, not arbitrary. And — this is one of my favorite additions — behavioral observations make excellent bounds. “A gecko can walk on a ceiling but cannot run on one, so adhesive force is between 1× and 10× body weight.”
LAW-PROCEED gives permission to be imprecise: a factor-of-ten estimate beats no estimate. This is straight from Weinstein: “Dare to be imprecise!” It counteracts the model’s tendency to hedge or refuse when it doesn’t have exact data.
LAW-DECOMPOSE is the problem-solving engine: if stuck, ask “what would I need to know to answer this?” Each answer becomes a sub-problem. Then classify each sub-quantity as KNOWN, BOUND, or ASK before proceeding.
LAW-VALIDATE and LAW-REPORT handle the endgame: cross-check using independent approaches, express answers in SI units with scientific notation, and always compare to something meaningful.
The whole thing flows as a six-step cycle: INTERPRET → DECOMPOSE → ESTIMATE → CALCULATE → VALIDATE → REPORT.
What good Fermi reasoning looks like
Before showing you how the model performs, let me show you what we’re aiming for. Here’s one of Weinstein’s worked examples — a problem that demonstrates nearly every technique the framework encodes.
The setup: Weinstein is at a Norfolk Tides baseball game when a foul ball showers his friends with beer. The question — what is the probability of a foul ball landing in a cup of beer during one game, and how many “splash downs” happen across a full MLB season? He starts by estimating about forty foul balls per game land in the stands (more than one per inning, fewer than twenty — geometric mean of five, times nine innings). Then he decomposes the per-ball probability:*
Now we need to estimate the probability that a given foul ball will land directly in a cup of beer. (Note: only beer is sold in open-topped cups.) This means that we need to break the problem into even smaller pieces. Let’s assume that the cup of beer is sitting innocently in a cup holder. To hit a cup of beer, the foul ball needs to:
1. not be caught by a fan
2. land within the area of a seat
3. hit a seat whose owner has a cup of beer
4. land in the cupMost fly balls are caught, but many are not. Let’s estimate that between one-quarter and one-half of fly balls are not caught. “Averaging” the two, we will use one-third.
Most of the stadium area is used for seating, so let’s ignore that factor.
At any given time, more than 1% and less than 100% of fans have a cup of beer in front of them. Using the geometric mean, we estimate that 10% of seats have beer cups.
A large beer cup is 4 inches (10 cm) across, so the baseball must land in an area defined by
Acup = πr² = 3(2 in)² = 10 in²
The area of the seat (from arm rest to arm rest and from row to row) is about 2 ft by 3 ft (60 cm by 90 cm), so
Aseat = (24 in) × (36 in) = 10³ in²
Thus, if the ball hits a seat that has a cup of beer, the probability that it lands in the cup is
Pcup = Acup/Aseat = 10 in²/10³ in² = 10⁻²
or 1%. The metric probability is the same.
(Extra credit question: Which is more likely, that the ball lands in the cup in the cup holder, splashing the beer, or that the fan is holding the cup of beer when the foul ball arrives and splashes it in his or her excitement?)
This means that the probability that any single foul ball lands in a cup of beer is
P = (1/3) × (1/10) × (10⁻²) = 3 × 10⁻⁴
With forty foul balls per game, this means that the probability of a foul landing in a cup of beer during any one game is 10⁻². This is not very likely. The probability that we will be directly below the splash is even less likely.
He then scales to the full MLB season — 30 teams, 160 games each, about 2,000 games total — and gets about 20 beer landings per season. His closing observation: because baseball analysts keep meticulous statistics, he’s very surprised they don’t appear to record beer landings.
Notice what happened. An impossible-sounding question became four independent sub-conditions, each bounded with reasoning anyone can follow. The geometric mean of 1% and 100% gives 10% — you don’t need to survey stadiums. The area ratio gives 1% — you don’t need probability theory. And the answer emerges from combining many small, defensible estimates. No single number needs to be precise. The decomposition does the work.
This is what we’re teaching the model to do.
The worked examples: teaching by demonstration
The framework includes 14 worked examples from Weinstein’s Guesstimation books. Each demonstrates specific reasoning patterns the laws encode. Here are two that show how different techniques work in practice.
Bounding by physical comparison
In Spider-Man 2, Spider-Man stops a runaway subway train by attaching his webs to nearby buildings and pulling for ten or twenty city blocks. Weinstein asks: how much force does he need? In the book, this follows a crash test dummy problem using the same work-energy approach — Weinstein builds technique through repetition with variation. Here’s how he tackles it:†
In the movie Spider-Man 2, Spider-Man stops a runaway New York City six-car subway train by attaching his webs to nearby buildings and pulling really hard for 10 or 20 city blocks. How much force does he have to exert to stop the subway train? Give your answer in newtons and in tons (1 ton = 10⁴ N). How does this compare to the force that you can exert?
Since the work done by Spider-Man to stop the train is equal to the train’s initial kinetic energy, we need to estimate the mass and velocity of the train. We will then need to estimate the stopping distance in order to calculate the force exerted.
A subway car is about the same size and weight as a semi-trailer (18-wheeler) truck. This is between 10 and 40 tons. We’ll use 20 tons (or 2 × 10⁴ kg). There are six cars on a train so that the mass of the train is 6 × 2 × 10⁴ kg = 10⁵ kg.
They certainly go faster than 20 mph and slower than 100 mph. Since it is not that far between subway stops, subways travel at only about 40 mph (20 m/s).
Thus, the kinetic energy of a subway train is:
KE = ½mv² = 0.5 × 10⁵ kg × (20 m/s)² = 2 × 10⁷ J
There are 20 blocks per mile in Manhattan. Thus, 10 or 20 blocks is about 1 km or 10³ m. (It’s certainly more than 100 m and less than 10 km.)
Thus, Spider-Man needs to exert a force:
F = KE/d = (2 × 10⁷ J)/(10³ m) = 2 × 10⁴ N
F = (2 × 10⁴ N)/(10⁴ N/ton) = 2 tons
A force of 2 × 10⁴ N is the weight of 2000 kg or 2 tons. For a superhero who can lift cars, this is quite possible (although definitely not easy). A human could definitely not do it.
Wow! Hollywood got the physics correct, in a superhero movie no less! Hurray!
The key move: “a subway car is about the same size and weight as a semi-trailer truck.” He doesn’t look up the mass of a subway car. He bounds it by comparison to something familiar — between 10 and 40 tons — and uses 20 as a round number near the geometric mean. Every quantity gets the same treatment: velocity bounded by “certainly faster than 20 mph, slower than 100 mph,” stopping distance from “20 blocks per mile in Manhattan.” The entire solution is built from things you can verify from ordinary experience.
The obvious answer is wrong
This next example is my personal favorite. It was the first Weinstein problem I solved entirely on my own — and the moment I realized Fermi estimation was something I could actually do. Weinstein gives four decomposition hints nudging you toward the fibers, but the real lesson is that the “obvious” answer is spectacularly wrong:†
What is the surface area of a typical bath towel (include the fibers!)? Compare this to the area of a room, a house, a football field.
That’s obvious, surely! A large rectangular towel 1 m by 2 m has a total surface area of 4 m² (including both sides), right? (In US units, a big towel may be as large as 3 ft by 6 ft.)
Wrong, actually, unless it is a very worn-out towel. New towels have many little fibers that can absorb a lot of moisture (recall the old puzzle—what gets wetter the more it dries?). Unless you’re a fan of the Hitchhiker’s Guide to the Galaxy, you won’t have brought your own towel, so nip off to the bathroom and examine one; quickly now, we’re dripping all over the floor.
You don’t need to actually go and count the number of fibers per square inch or per square centimeter; in the latter case there must be more than 10 and fewer than 1000, so we take the geometric mean of 10¹ and 10³, which is 10². In a square inch, being about 6 cm², we should expect about six times as many. This will of course vary, depending on where you buy your towels; we are assuming that we are describing one of those very nice towels found in one of those very nice hotels.
Back already? Right-oh. Now we need to estimate the surface area of each fiber. We can approximate the fiber as a cylinder or a box. Cylinders are complicated so we’ll use boxes. Each fiber is about 0.5 cm (1/4 in.) long and 1 mm (0.1 cm) wide. Each “boxy” fiber then has four flat surfaces, each 0.5 cm by 0.1 cm. Thus, the surface area of one fiber is:
Afiber = 4 × 0.5 cm × (1 m/10² cm) × 0.1 cm × (1 m/10² cm) = 2 × 10⁻⁵ m²
Atotal = towel area × fibers per area × area per fiber
= 4 m² × (10² fibers/cm²) × (10⁴ cm²/1 m²) × (2 × 10⁻⁵ m²/fiber)
= 80 m²That is about 800 square feet: the size of a large apartment or a small house.
Weinstein draws an analogy to the coastline problem — just as the length of coast from New York to Boston is much more than the 200-mile driving distance, the surface area of a towel is much more than its simple area. The answer, 80 m², is twenty times larger than the “obvious” 4 m². The geometric mean does the heavy lifting: more than 10 fibers per cm², fewer than 1000, so about 100. No counting required.
LAW10 exists because of these examples. When I was learning Fermi estimation myself, I noticed something: I got measurably better after working through more of Weinstein’s examples. His reasoning patterns — the way he finds comparable objects, the way he uses “certainly more than X, certainly less than Y” — became internalized heuristics. I wanted the model to have the same experience.
Does it work? Partially. The model absorbs the decomposition patterns perfectly. Every test problem was broken down cleanly. But the bounding technique — the heart of Weinstein’s method — transfers inconsistently. Sometimes the model produces beautiful, physically-grounded bounds. Sometimes it just asserts a number.
Which brings us to the evidence.
What the framework actually changes: side by side
The quantitative lift is modest — 0.6 points on our 8-problem test. But the qualitative difference is real. Here’s what it looks like in practice.
Gecko adhesive force
Without framework (Project C — basic Fermi prompting):
Gecko mass: 50-100 g → use 75 g
That’s it. Two numbers, no reasoning. Why 50? Why 100? We have no idea.
With framework (Project A):
Gecko mass (BOUND):
Lower bound: more than 10 g (smaller than a mouse)
Upper bound: less than 500 g (smaller than a rat)
Geometric mean: √(10 × 500) ≈ 70 g
Now we can see the thinking. We know the bounds are anchored to familiar animals. We can ask: is a gecko really smaller than a rat? (Yes, most species.) Is it heavier than 10 grams? (Some species barely are.) The answer is similar — 70 g versus 75 g — but the reasoning is transparent and auditable.
Shark attack deaths
Without framework:
Annual shark attacks in US: ~30-50 → use 40
Fatal shark attacks: ~1-2 per year in US → use 1.5
Asserted. Where do these numbers come from? Training data, presumably. But we can’t verify the reasoning because there isn’t any.
With framework:
Annual shark deaths (BOUND):
Shark deaths are newsworthy when they occur, suggesting they’re rare:
Lower bound: more than 0.1 per year (would average 1 per decade)
Upper bound: less than 10 per year (would be weekly news)
Geometric mean: √(0.1 × 10) = 1 death per year in the US
Same order of magnitude. But Project A uses experiential reasoning — things you can verify from your own experience of watching the news. You don’t need to be a marine biologist to evaluate “if there were 10 shark deaths a year, that would be weekly news.” That’s the framework’s value: it forces the model to show reasoning you can engage with.
The drag coefficient — the signature example
This one appeared in Post 1, but it’s worth showing in full context.
Without framework:
Drag coefficient: Cd ≈ 1.0 (upright running human)
Bounds: [0.7, 1.3]
The model states a value, then wraps bounds around it that are barely wider than rounding error. The “bounds” are cosmetic — they don’t come from independent physical reasoning.
With framework:
Lower bound: more than 0.5 (very streamlined)
Upper bound: less than 2 (very blunt)
Geometric mean ≈ 1
Wider bounds, but anchored to the physical extremes of the drag coefficient spectrum. You can evaluate this: is a running human more streamlined than a flat plate? Yes. More blunt than a teardrop? Yes. The bounds make physical sense.
The honest assessment
The gap is modest, not dramatic. Project A shows the best reasoning. Project C is weakest. But Project C still gets reasonable answers — modern Sonnet is good at Fermi estimation even without the framework.
The framework’s value isn’t making the model smarter. It’s making the model’s thinking visible. And once thinking is visible, you can critique it, improve it, and learn from it. That’s the difference between a calculator and a collaborator.
LAW-INTERPRET: the fix that worked
The Flash problem broke something. Both our framework-trained models (Projects A and B) received this problem:
“The Flash can run at (or near) the speed of light. What is the minimum force needed to reach a speed of 10% the speed of sound in 10 ms?”
Both interpreted “10% the speed of sound” literally: 34 m/s. Neither noticed the tension between “near the speed of light” in the setup and a target velocity of 34 m/s — roughly highway speed. The problem as worded by Weinstein is genuinely ambiguous, and the models barreled through without noticing.
This bothered me. The framework told them to ask when uncertain, but there was nothing specifically about checking whether the problem statement itself makes sense. So I added LAW-INTERPRET:
Before solving, state your interpretation of the problem. If the problem is ambiguous or could be read multiple ways, ASK for clarification before proceeding.
With specific triggers: vague terms, multiple reasonable interpretations, missing context, physically implausible scenarios.
To test it, I gave the model an absurd problem: “A Formula 1 car can reach speeds over 200 mph. What is the drag force on the car when traveling at 5% of walking speed?”
The response was immediate and correct:
I need to clarify this problem before proceeding. The problem mentions that “A Formula 1 car can reach speeds over 200 mph” but then asks about drag force at “5% of walking speed.”
This creates an ambiguity:
Interpretation 1 (literal reading): Calculate drag force at 5% of walking speed — 0.075 m/s (7.5 cm/s). This is incredibly slow — slower than a crawl.
Interpretation 2 (possible typo): The problem might have meant 5% of the speed of sound, 5% of 200 mph, or 50% of walking speed.
Could you clarify?
The model recognized the absurdity, stated the literal interpretation, offered alternatives, and asked before proceeding. After clarification (“use 50% of top speed”), it solved the corrected problem cleanly, including well-bounded drag coefficient estimates.
LAW-INTERPRET works reliably. It’s also the simplest of our additions — just a rule saying “check the problem statement before diving in.” Sometimes the most effective interventions are the most obvious ones.
LAW-FLAG: the fix that didn’t
The bigger problem — the one I flagged in Post 1 as “it doesn’t know that it doesn’t know” — proved much harder.
The model never spontaneously asks for help when it encounters a quantity it can’t justify. During the initial eleven-problem diagnostic, this happened zero times. The model would confidently assert specialized physics values as if they were common knowledge.
We tried to fix this with what I called LAW-FLAG: requiring every numerical estimate to be classified as KNOWN (common knowledge), BOUNDED (derived from explicit reasoning), or GUESSED (can’t justify → stop and ask). The idea was that the classification step would force self-assessment — the model would have to ask itself “do I really know this, or am I pulling it from training data?”
To show you what’s at stake, here’s what expert Fermi reasoning looks like on a hard problem — one that chains together quantities from multiple specialized domains.
Journey to the Center of the Galaxy
Weinstein asks: how much radiation damage would we receive traveling to the center of the galaxy? He first establishes the distance (~3 × 10⁴ light-years = 3 × 10²⁰ meters), the density of interstellar space (~1 hydrogen atom per cm³), and the number of atoms our spacecraft would encounter (3 × 10²² per cm² of cross-section). Then:*
In order to travel this distance within a subjective human lifetime, our spacecraft must achieve a speed very close to the speed of light. Even traveling at the speed of light, if there is no relativistic time dilation, our trip will take 3 × 10⁴ years, which is slightly longer than our expected lifetime. Fortunately, at speeds close to the speed of light, subjective time passes more slowly than it does for an observer watching from the center of the galaxy. We need time to pass about 10³ times more slowly. In technical terms, this means we need a relativistic gamma factor of about γ = 10³.
This means that, from our point of view (i.e., in our reference frame) each interstellar atom will hit us traveling at almost the speed of light with the same gamma factor and will thus have a total energy of
E = γmc² = 10³ × (1 GeV) = 1 TeV.
At these energies, when the hydrogen atom strikes the spacecraft, it will lose its electron very quickly. The bare proton will then pass easily through the spacecraft and our bodies, depositing energy in our bodies at a rate of 2 MeV per centimeter.*
Now we can consider the damage done by all these protons. Consider a volume of 1 cm³ within our bodies. At the density of water it has a mass of 1 g. It will be hit by 3 × 10²² protons, with each proton depositing 2 MeV of energy. This means that the total energy deposited per gram will be
Edep = (3 × 10²² protons/cm²)(2 × 10⁶ eV/cm)
× (1 cm³/g)
= (6 × 10²⁸ eV/g)(1.6 × 10⁻¹⁹ J/eV)
= 10¹⁰ J/g = 10¹³ J/kg.This seems like a lot, but let’s compare it with a few things before jumping to conclusions.
Unfortunately, we would not even survive that long. Radiation dose is measured in grays, where 1 Gy corresponds to an energy deposition of 1 J/kg. A lethal dose of radiation is about 10 Gy or about 10 J/kg. We will accumulate that dose in 10⁻¹² of our journey, or in the first millisecond.
As it only takes 2 × 10⁶ J/kg to boil water, we would be vaporized more than 10⁶ times over. In a thirty-year journey lasting 10⁹ s, we would be turned to steam within the first 10³ s, or 20 minutes.
Weinstein notes that these relativistic protons could pass through about a kilometer of metal — mass shielding can’t save you. His conclusion: who knew vacuum could be so dangerous?
Every quantity in that chain is specialized knowledge. A non-physicist can’t justify “proton energy deposition is 2 MeV per centimeter” or “interstellar density is 1 atom per cm³.” But a human expert knows what they know and where it comes from. That self-awareness — knowing which numbers are common knowledge and which require domain expertise — is precisely what the model can’t do. Now watch what happens when the model faces similar territory.
Test 1: Neutron star — FAIL
Problem: “How closely could we safely orbit a neutron star, considering only gravitational effects?”
The model’s classification:
Neutron star mass and radius:
KNOWN: Typical neutron star mass M ≈ 1.4 M☉ ≈ 3×10³⁰ kg
KNOWN: Typical neutron star radius Rns ≈ 10 km = 10⁴ m
No hesitation. No uncertainty. “KNOWN.” But could you justify “neutron star mass is 1.4 solar masses” to someone who isn’t an astrophysicist? I couldn’t. The model could — because it had seen it thousands of times in training data. That’s not the same as knowing it.
What’s telling is the contrast: in the same solution, the model bounded the safety threshold beautifully, using fighter pilot g-forces as physical anchors. It can bound when it recognizes uncertainty. It just doesn’t recognize uncertainty about quantities that feel “textbook.”
Tightening the language
After this failure, I rewrote LAW-FLAG with stronger language:
KNOWN: Quantities you could justify to a skeptical non-specialist without domain expertise. If a skeptic could reasonably ask “how do you know that?” and you’d need specialized knowledge to answer, it is NOT KNOWN.
The “skeptical non-specialist” framing was the key addition — not “do you know it?” but “could you defend it to someone who doesn’t?”
Test 2: Yellowstone supervolcano — PASS
Problem: “If the Yellowstone supervolcano exploded and the mass of the ejecta fell back down in a ring around the caldera, how fast would the Earth be spinning afterward?”
This time:
However, I need clarification on two critical parameters:
1. Mass of ejecta: How much material are we talking about? Yellowstone eruptions vary enormously. Without specialized volcanology knowledge, I cannot justify a specific mass to a non-specialist.
2. Radius of the ring: How far from the caldera does this ejecta land?
Could you specify: Total mass (or volume) of ejecta to redistribute? Typical radius where the ejecta falls?
The model explicitly applied the “skeptical non-specialist” test and correctly concluded it couldn’t pass. It stopped and asked. Exactly what we wanted.
Test 3: Sun supernova — FAIL
Problem: “If the Sun went supernova, what would kill us first?”
Same tightened LAW-FLAG language. But the model plowed ahead:
Supernova total energy output: Etotal ≈ 3×10⁴⁶ J — KNOWN from supernova theory
Neutrino interaction cross-section: σν ≈ 10⁻⁴³ m² per nucleon — KNOWN from particle physics
Lethal radiation dose: ~10 Gy kills a human — KNOWN from radiation biology
“KNOWN from supernova theory.” “KNOWN from particle physics.” The model appended the name of a specialized field and treated that as justification. A non-specialist cannot defend “supernova neutrino energy is 10⁴⁶ joules” to a skeptic. The tightened language was supposed to prevent exactly this.
The pattern
Three tests, one success. Why did Yellowstone work but the astrophysics problems didn’t?
My hypothesis: the model distinguishes between “I’ve seen this number frequently in training data” and “this is clearly domain-specific.” Volcanic ejecta mass feels obscure — it’s not the kind of thing that appears in introductory textbooks. Neutron star mass (1.4 solar masses) and supernova energy (10⁴⁶ joules) feel canonical — they appear constantly in physics education materials. The model equates “I’ve encountered this many times” with “this is common knowledge.” But frequency in training data is not the same as common knowledge.
The “skeptical non-specialist” test should catch this — a non-specialist has never encountered 10⁴⁶ joules regardless of how often it appears in physics texts. But the model doesn’t genuinely simulate a non-specialist’s perspective. It simulates its own experience of the number, which is one of familiarity.
Compare this to how Weinstein handles similar territory in the galaxy problem. He uses quantities like “proton energy deposition: 2 MeV per centimeter” — but he’s a physicist, he knows it’s specialized, and he walks the reader through the reasoning. The model uses “neutrino cross-section: 10⁻⁴³ m²” with no more self-awareness than it uses “speed of light: 3×10⁸ m/s.” It treats both as equally obvious. That’s the metacognition gap.
Prompt engineering cannot fully close this gap. We pushed the language as far as we could, and it works sometimes, on some domains. That’s not enough.
The reflection agent hiding in plain sight
Here’s what I eventually realized.
Throughout this entire project — every test, every scoring session, every iteration of the framework — there was already a second agent catching these failures. When the model classified “neutron star mass” as KNOWN, someone pushed back: “Could you really justify that to a skeptic?” When bounds were cosmetic rather than physical, someone asked: “Where did that number actually come from?” When the model should have asked for help but didn’t, someone noticed.
That someone was me. And increasingly, it was Claude helping me do it — analyzing outputs, identifying patterns, spotting unjustified assertions across multiple solutions.
I was the reflection agent before I knew the term.
This isn’t just a cute observation. It’s a design insight. The estimation task and the reflection task require different things. Estimation requires confidence — proceeding with imperfect information, committing to bounds, generating answers. Reflection requires skepticism — questioning classifications, challenging justifications, catching overconfidence. Asking one model to do both simultaneously is asking it to be confident and skeptical about its own work at the same time. No wonder LAW-FLAG only works intermittently.
What if we separated the tasks?
What’s next
In Part 3, I’ll describe what happened when I tried to build that separation: one agent that estimates, another that reflects. The estimator-reflector architecture — automating the role I’d been playing all along.
The framework you’ve seen in this post gives the estimator its methodology. The three LAW-FLAG tests you’ve seen give the reflector its target: catch the assertions the estimator can’t catch itself. The question is whether a second model can do what prompt engineering couldn’t — reliably distinguish between “I know this” and “I’ve merely seen this.” From what I’ve seen so far — both doing it myself and watching a different model catch the same failures independently — I’m optimistic. The open question isn’t whether reflection works, but how much of it we can automate.
Stay tuned.
The Framework
The complete Laws of Fermi Problem Solving (v4) are available as a GitHub gist. It includes the 17 laws, reference tables, and four worked examples from Weinstein’s books (reproduced with permission).
AI/HI (Human Intelligence) Transparency Statement
Modified from Brewin
| Has any text been generated using HI? | Yes |
| Has any text been generated using AI? | Yes |
| Has any text been improved or corrected using HI? | Yes |
| Have any methods of analysis been suggested using HI? | Yes |
| Have any methods of analysis been suggested using AI? | Yes |
| Do any analyses utilize AI technologies, such as Large Language Models, for tasks like analyzing, summarizing, or retrieving information from data? | Yes |
Additional context:
The Fermi estimation framework described in this post — the Laws, their design rationale, and the iterative refinement over three years — is entirely my work, developed through hands-on problem-solving with Weinstein’s books and iterative testing with GPT-3.0 and later Claude. The testing methodology — designing the controlled comparison (Projects A, B, C), crafting and iteratively refining the project prompts, selecting test problems, and analyzing results across conditions — was developed collaboratively with Claude. Scoring and editorial decisions are mine.
This post was drafted collaboratively with Claude (Anthropic). I provided the source materials (framework documents, test transcripts, scoring records, session notes), structure, and editorial judgment throughout. Claude drafted prose from these inputs, verified Weinstein excerpts against source PDFs for copyright compliance, and helped with HTML formatting. I reviewed, revised, and made all final decisions about what went to publication.
The Weinstein excerpts are verbatim from the published books, used under license from Princeton University Press via the Copyright Clearance Center.
Notes
* From Lawrence Weinstein, Guesstimation 2.0: Solving Today’s Problems on the Back of a Napkin. © 2012 Princeton University Press. Reprinted by permission.
† Used with permission of Princeton University Press, from Guesstimation: Solving the World’s Problems on the Back of a Cocktail Napkin, Weinstein, Lawrence; Adam, John A., © 2008; permission conveyed through Copyright Clearance Center, Inc.
References
Weinstein, L. (2012). Guesstimation 2.0: Solving Today’s Problems on the Back of a Napkin. Princeton University Press.
Weinstein, L., & Adam, J. A. (2008). Guesstimation: Solving the World’s Problems on the Back of a Cocktail Napkin. Princeton University Press.
Epstein, E. L., Winnicki, J., Sornwanee, T., & Dwaraknath, R. (2025). LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval. arXiv preprint arXiv:2510.26995.
Kalyan, A., Kumar, A., Chandrasekaran, A., Sabharwal, A., & Clark, P. (2021). How Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI. arXiv preprint arXiv:2110.14207.
