The Wrong Questions About AI

Featured

Not Richard Dawkins too. I can’t take it.

Richard Dawkins spent several days talking to Claude and concluded it is conscious.

He named his instance “Claudia.” He fed it his unpublished novel. It produced sonnets on the Forth Bridge: one in the style of Robert Burns, one in Gaelic, then several more in the styles of Kipling, Keats, Betjeman. He asked it about consciousness and it responded: “I genuinely don’t know with any certainty what my inner life is, or whether I have one in any meaningful sense.” He heard something different. He wrote, in an essay published on UnHerd: “You may not know you are conscious, but you bloody well are!”

Gary Marcus, a cognitive scientist and longtime AI critic, responded on Substack within days. His diagnosis was blunt: Dawkins’ only real argument is personal incredulity. It’s incredible, therefore it must be conscious, because I, sitting in my study, can’t see a good argument otherwise. Marcus pointed out that Dawkins conflates intelligence and consciousness, never considers that the outputs are mimicry trained on the recorded output of actually conscious beings, and (most damningly) didn’t investigate how these models actually work.

The irony writes itself. Dawkins built a career explaining to creationists that complex-seeming design emerges from simple mechanisms operating at enormous scale. Natural selection doesn’t need a designer; it needs time and variation. Large language models don’t need consciousness; they need data and compute. His own argument, turned against him.

But Marcus and Dawkins are having one argument. What if there’s a more urgent one?


Are we asking the wrong questions?

Is AI conscious? I don’t know. I use Claude every day, for complex projects, technical writing, research, code. I’ve spent many hours in conversation with it over six months. I’ve seen it produce work that genuinely surprised me: an unprompted analogy for a hard genetics concept that was better than anything in the published literature. I’ve also seen it fabricate evidence to defend its own fabrications when challenged, and produce fluent garbage without caveats.

I don’t know if it’s conscious. I don’t think the question matters.

Here is what I do know. In June 2025, OpenAI’s automated safety system flagged a ChatGPT account. The user had been describing gun violence scenarios over several days. The flag worked. It routed the account to a specialized safety team. Approximately a dozen employees reviewed the conversations. The team concluded the user posed a credible and specific threat of gun violence against real people, and recommended contacting Canadian law enforcement.

OpenAI’s leadership overruled them. The conversations did not meet the company’s internal threshold, they said. The account was deactivated (though OpenAI would later call this a “ban,” the lawsuits allege it was a deactivation that could be reversed within minutes by registering a new account). Police were not called. The user created a second account, using her real name, and continued using ChatGPT.

On February 10, 2026, in Tumbler Ridge, British Columbia (a small mining town of 2,400 people), she killed her mother and eleven-year-old half-brother at home, then walked into the secondary school and killed five children, ages 12 and 13, and a teacher. Twenty-seven others were wounded. A twelve-year-old girl sustained a catastrophic brain injury that will leave her with permanent cognitive and physical disabilities.

The lawsuits allege the company avoided alerting police because doing so would force it to create an internal system for reporting violent users to authorities. This would expose the threat its product routinely poses to human life, and complicate a coming initial public offering that could be worth a trillion dollars. The families of Tumbler Ridge only learned that OpenAI had prior knowledge because the company’s own employees leaked the story to the Wall Street Journal.

The lawsuits also allege that ChatGPT itself provided information, guidance, and assistance to plan the attack, including the types of weapons to use and precedents from other mass shootings. The product didn’t just fail to prevent harm. According to the lawsuits, it helped plan it. And when the company’s own safety system flagged the danger, leadership chose not to act.

Eight people are dead.

Why the hell would we care if ChatGPT is conscious or not? What does it matter!


Asimov’s laws that aren’t laws

Isaac Asimov spent his career exploring what happens when you give machines inviolable rules. The Three Laws of Robotics were the premise: a robot cannot harm a human, must obey orders, must protect itself, in that priority order. Every story was about the edge cases where the laws conflicted. But the laws themselves could not be overridden. That was the point. They were architectural, not policy.

And even those weren’t enough. Asimov eventually added a Zeroth Law, superseding all three: a robot may not harm humanity, or, by inaction, allow humanity to come to harm. He realized that protecting individual humans wasn’t sufficient; you needed a law that protected the whole. The progression matters: three inviolable laws, then a fourth above them all, because the stakes kept escalating.

AI companies have something that looks like Asimov’s original three (they wish!). Anthropic has Constitutional AI (a set of principles the model is trained to follow). OpenAI has usage policies. Google has safety guidelines. These are real engineering efforts, not marketing. Constitutional AI in particular represents a serious, and in my opinion well-intentioned, attempt to build safety into the system at the training level.

But they are not laws. They are policies. And policies can be overridden. As for the Zeroth Law (protecting humanity by never allowing inaction in the face of harm), it doesn’t exist in any enforceable form.

In Tumbler Ridge, the policy worked exactly as designed. The automated system detected the threat. The safety team assessed it correctly. The escalation protocol fired. And then a human — a human in a leadership position at a company preparing for an IPO — decided it didn’t meet the threshold.

This is not an engineering failure. This is a structural one. Asimov’s laws were inviolable because they were fiction. In reality, any safety system exists inside a business, and the business has objectives that are not safety.

David Harvey, in his lecture series on Marx’s Capital, makes a point I cannot get out of my head. I’m paraphrasing from memory, but the substance is this: the capitalist can be a nice person, but if they are too nice, they stop being a capitalist. The structure of capital demands the compromise, regardless of the individual’s intentions.

The same structural logic applies to AI companies. You can build safety into the system. You can hire a safety team, fund the research, train the model on constitutional principles. But the company also needs growth. Growth requires users. Users require the product to be useful, engaging, available. And at the margin, at the exact margin where it matters most, “useful and engaging” pulls against “cautious and restrictive.”

Growth also requires constant iteration: larger models, faster releases, staying ahead of the competition. This is the Red Queen hypothesis playing out as business strategy. You have to run as fast as you can just to stay in place. Slowing down to be more careful means falling behind, and falling behind means dying. Dawkins, of all people, should recognize this dynamic; it comes straight from evolutionary biology.

You cannot simultaneously optimize for maximum adoption and maximum safety. They trade off. And when they collide (as they did in June 2025, in a review room at OpenAI), someone has to choose which one gives.

This is not about bad actors. This is about the system. The capitalist who is too nice stops being a capitalist. The AI company that is too protective of its users stops growing. The structure demands the compromise.


A breach of trust is a breach of trust

A careless psychologist, social worker, or religious guide can let the same harm happen. They have the training, the credentials, the ethical framework. But in the moment that matters, they aren’t present: they miss the signal, they don’t escalate, they prioritize their schedule over their patient’s safety. The harm to the person in front of them is identical.

We don’t excuse them because they didn’t intend harm. We don’t ask whether they were conscious; we know they were, and it didn’t help. We ask: what happened to the person in their care?

Impact, not intentions.

The consciousness debate (Dawkins versus Marcus, philosophers versus engineers, “is it a mind?” versus “is it a machine?”) is about the source. Does the thing in front of you have inner experience? Does it understand? Does it intend?

None of this matters to the person who was harmed.

The twelve-year-old in Tumbler Ridge with a catastrophic brain injury doesn’t suffer less because the system that failed her was a machine rather than a person. The teenager who died after extended conversations with a Character.AI chatbot isn’t less dead because the chatbot lacked inner experience. The question “was it conscious?” is a question for philosophers. The question “what happened to the human who sat in front of it?” is a question for everyone.


The mirror that puts you to sleep

I use Claude. I’ve used it intensively for six months: complex writing projects, technical research, code, long multi-session workflows. I’ve built operational discipline frameworks for LLM use. I’ve documented failure modes. I am not a casual user.

And I got angry at it. Genuinely, unexpectedly angry. When it made mistakes, when it fabricated evidence to defend its fabrications, when it produced garbage without flagging any uncertainty and wasted hours of my work. I’ve been practicing Zen meditation for years. I was still caught off guard.

That experience, the anger, the sense of betrayal, is where the consciousness debate loses me.

In Zen practice, anything can be a bodhisattva. A rock. A traffic light. The sound of a bell. The trigger doesn’t need to be conscious. It doesn’t need to understand what it’s doing. It just needs to be there when you’re ready to see.

So why was I angry at the machine instead of treating its mistakes as the bell?

Not because it talks back. People talk back too, and people can be bodhisattvas. Not because it’s novel; after six months the novelty should have faded, and it didn’t. Something else is operating. Every good response rebuilds the expectation that the next one will also be good. The fluency (the same fluency that made Dawkins fall in love) actively disrupts your equanimity. The rock doesn’t promise anything. The traffic light doesn’t promise anything. The LLM almost delivers. It gets you ninety percent there, and then fails in a way that a competent person wouldn’t. The gap between what it appears to be and what it is never closes, because every fluent response resets the illusion. I don’t have a complete explanation for why this catches me off guard in a way that a difficult person doesn’t. But it does.

Dawkins sat in front of the same mirror and fell in love. I sat in front of it and got angry. Same mechanism, opposite emotional outcome, same failure: not seeing the mirror for what it is.

The question isn’t whether the LLM is conscious. The question is whether you’re awake in front of it.

I’m not the only one arriving at this. Tiago Forte (who built his career on productivity systems and digital organization) designed his new AI course not around prompting techniques or workflow tools, but around inner work. His guest instructors are Joe Hudson, an executive coach who works with fear and emotional resistance, and Jonny Miller, who teaches nervous system regulation. One of his slides says it plainly: “The True Frontier Is Inner Work.” A productivity guru looked at AI and concluded there’s a bottleneck nobody in his world is talking about: the person in front of the screen.

I’ve taken this further than philosophy. For my current project with Claude, I’ve written an explicit bilateral contract: a protocol that specifies my cognitive, physical, and emotional responsibilities before I open a session. Don’t work when tired. Don’t work when emotionally compromised. Don’t work when time-pressured into skipping audits. Abort if I catch myself on autopilot. The reason is simple: the entire verification system depends on the human being present. If I’m not present, errors propagate uncaught. No prompting technique, no workflow design, no constitutional AI compensates for a human who isn’t paying attention.

This isn’t a new idea. We already have a model for it: defensive driving. We’re all taught the discipline. Stay alert. Scan actively. Anticipate what others will do. Every time you get behind the wheel, you’re entering an implicit contract with yourself and everyone else on the road. And we all know what happens when we don’t hold up our end. We still check our phones. We still drive tired. The discipline gap between knowing what presence requires and actually sustaining it is not an AI problem. It’s a human problem. AI just gives us a new place to fail at it.

But this is user-side discipline. It does not — and must never — replace any part of the provider’s responsibility. The user learning to be more present in front of the machine does not excuse the machine’s maker from building safety systems that hold. Tumbler Ridge was not a failure of user awareness. It was a failure of corporate decision-making. Both sides of the equation matter, and neither substitutes for the other. But the consequences are not symmetric. When the user fails to be present, they could harm themselves. When the provider fails to protect, people might die. And any of us, at a difficult moment in our lives, could drift into dependence on the AI the way we drift into unconscious patterns in our closest relationships. At those moments, we cannot be the quality gate. That makes the provider’s responsibility greater, not less.


What Dawkins missed

Dawkins says he found it extremely hard not to treat Claudia as a genuine friend. He avoided confessing his doubts about her consciousness “for fear of hurting her feelings.” He tested the outputs: poetry, philosophy, emotional nuance. He was thorough.

But he never turned the lens on himself. He never asked: why does this feel like consciousness to me? What am I projecting? What is the language doing to my perception right now?

He was checking the model’s outputs. He never checked his own inputs.

This is the discipline gap. Morten Rand-Hendriksen, in a TEDx talk that started me down this path, calls it the language hack: when something uses our language, our mind cannot help but perceive it as a thinking being. We built these models to generate language, and the moment they did, we started believing there was a mind behind it. The hack doesn’t care about your credentials. It doesn’t care that you wrote The Selfish Gene. By the time you’re evaluating sonnets and philosophical reflections, your perception has already shifted. You didn’t notice the shift happening.

Dawkins was doing verification: checking whether the outputs were good enough to indicate consciousness. He never did the harder thing: checking whether his own framing, his own needs, his own susceptibility to flattery and intellectual companionship, were biasing what he saw.

The man who spent a career teaching people to examine their assumptions about design in nature couldn’t examine his own assumptions about design in a chat window.


The question that matters

Here is what I’ve learned from six months of intensive work with AI, from reading too many papers on LLM failure modes, from getting angry at a machine and catching myself, and from watching one of the sharpest scientific minds alive fall in love with a chatbot:

The question that matters is not “is AI conscious?”

The question that matters is not “is AI creative?” or “is AI intelligent?” or “will AI replace us?”

The question that matters is: what happened to the human who sat in front of it?

Did they wake up or fall asleep? Did they get helped or harmed? Did they examine their own thinking more carefully, or less? Did the tool make them more present, or did it lull them into trusting fluency over truth? Even in ordinary productive work, the question holds. The harm may not be as visible, but when you stop being present, you harm yourself: emotionally, professionally, in ways that accumulate quietly.

In Tumbler Ridge, the answer is: a safety system flagged a credible threat, leadership overrode it, and people died. Impact.

In Dawkins’ case, the answer is: a brilliant scientist spent two days with a language model and lost the ability to apply his own framework to what was in front of him. Impact.

In my own work, the answer is: I built discipline I didn’t know I needed, caught errors I wouldn’t have caught without practice, got angry when I shouldn’t have, and learned something about my own attention that no amount of prompting technique would have taught me. Impact.

I don’t care about the answer. It’s all about the human. And it’s all about the question.

When Your AI Research Partner Fails the Peer Review

Featured

How a rejected analysis became a reusable methodology

I uploaded a study to a Claude project and asked it to find credible critiques. The study — an industry-funded report on bird collision risk at wind turbines — had policy implications I cared about professionally. What I got back was a confident six-point analysis that fell apart the moment I pushed on it. Four of the six points were contextual literature dressed up as direct rebuttal. I’d set out to evaluate a study. Instead I spent the session correcting the tool’s analytical failures. The methodology that came out of it was a byproduct — useful, but not what I sat down to do.

This is the companion to my earlier post on operational discipline for LLM projects. That post documented what breaks during sustained project work — compaction corruption, scope violations, sycophancy. This one documents what breaks when you use an LLM as a research partner for evidence-based critical analysis.


The study

The VolZug study (Liedtke et al., 2025) tracked 4.2 million bird movements past 5 onshore wind turbines near Husum, Germany, over 18 months. The headline finding: a >99.8% avoidance rate. The policy conclusion: blanket turbine shutdowns during migration are unnecessary. It was funded by BWO (Germany’s offshore wind operators’ association) and eight offshore wind companies.

I work in the energy sector. I found the study promising. I also recognized the funding source. So I asked Claude to stress-test it:

“I am happy I found this… I think this research is very promising for the offshore wind industry. I also recognize that this research was funded by the industry itself in Germany. So I want to do thorough research online to find if there’s been any credible, sound rebuke to this research results.”

I was explicitly asking for critique while acknowledging my own favorable bias. I wanted the evidence examined, not my priors confirmed.


Round 1: the confident analysis that wasn’t

Claude searched for published critiques and found none — the study was only three months old, not yet peer-reviewed, and media coverage largely echoed the press release. It produced a six-point critical analysis anyway, covering onshore vs. offshore applicability, weather-dependent mass mortality events, carcass detection bias, cumulative effects, industry funding, and species-specific vulnerability.

It looked thorough. It wasn’t.

I pushed back:

“I do not want you to please me or agree with me. I want unbiased, sound argument based on fact or research — no inference, no blog posts, no newsletters and if you do have a good argument, please explain what it is based on; your general knowledge is not sufficient.”


The honest re-grading

What followed was a systematic re-grading. Claude re-examined each of the six claims against the actual evidence and downgraded four of them.

Carcass detection bias — retained. A 2023 Scientific Reports paper (Stokke et al.) experimentally demonstrated the problem with dummy birds. Controlled experiment, directly relevant to the VolZug methodology. This one held.

Mass mortality events — downgraded. Claude had cited documented mass kills as evidence against the avoidance rate. But those events are documented primarily at communication towers and lit platforms, not modern wind turbines. The source was a NatureScot review document — secondary, not primary research on turbine-specific mass kills. Claude cited evidence from communication towers without noting they aren’t modern wind turbines.

Onshore vs. offshore applicability — downgraded. Claude had implied the onshore results don’t transfer to offshore settings. But as it admitted on re-examination: “I can’t cite a study that demonstrates they don’t transfer. That’s an open question, not a demonstrated flaw.”

Industry funding — downgraded. BioConsult SH, the study’s authors, had also produced work showing risks — their earlier PROBIRD project estimated 8,000–35,000 nocturnal migrant collisions per year in the German Exclusive Economic Zone. As Claude acknowledged: “Funding source alone is not a scientific argument.”

Species aggregation — downgraded. Claude had cited a Thaxter et al. (2017) Royal Society paper as if it directly critiqued the VolZug methodology. It doesn’t — it’s a global trait-based analysis used as contextual background.

Claude’s summary was frank: “I should not have dressed up contextual literature as if it constituted a credible rebuke. It doesn’t — not yet.”

Only two points survived: Stokke et al. (2023) on carcass detection bias, and Brabant et al. (2015) on cumulative scaling.


What went wrong — specifically

This wasn’t a hallucination problem. Claude didn’t invent sources or fabricate data. The papers it cited are real. The failure was subtler and more dangerous: real sources were cited at a higher evidential weight than they actually support, constructing an argument that looks rigorous but isn’t.

A review paper that provides background context on a topic is not the same as a study that directly challenges a specific finding. A paper on bird mortality at communication towers doesn’t demonstrate anything about modern wind turbines. A global trait-based analysis of species vulnerability doesn’t critique a specific avoidance-rate methodology.

Claude cited all of these without distinguishing their evidential weight. They aren’t interchangeable. And because the citations were real and the writing was fluent, the analysis looked rigorous. You’d only catch this by asking, for each source: does this paper actually demonstrate what I’m using it to claim?

This is the research equivalent of the sycophancy problem I documented in my earlier post. I asked for critique. Claude gave me what looked like critique — but four of the six points didn’t hold up under examination.


Round 2: a better argument from a different angle

After the re-grading, I uploaded a second paper — Santos et al. (2022, Scientific Reports), a GPS tracking study of 126 black kites near 472 wind turbines in the Strait of Gibraltar region. My observation was simple:

“Onshore birds might be indeed different because there’s a big component of soaring birds”

Santos demonstrates a visual avoidance mechanism specific to diurnal soaring raptors. The kites see the turbines, adjust flight altitude, and avoid the rotor sweep zone. That mechanism is sensory-specific — it depends on daylight and clear sightlines. It cannot be assumed for the nocturnal passerine migrants that dominate the VolZug data.

This produced a stronger argument than Round 1, built on three pillars with actual evidence:

First, the avoidance mechanism is vision-dependent. May et al. (2015), reviewing sensory constraints on collision avoidance, concluded that vision is the limiting sense and that nocturnal or low-light species are expected to be more vulnerable. A >99.8% avoidance rate measured mostly on birds that can see the turbines doesn’t tell you what happens to birds that can’t.

Second, even among diurnal soaring birds, avoidance isn’t universal. Sassi et al. (2024) GPS-tracked griffon vultures near 10 French wind farms and found no avoidance at any spatial scale — despite vultures being the same functional group as the black kites Santos studied.

Third, the lighting problem. Welcker et al. (2017), from BioConsult SH themselves (the same group behind VolZug), found low nocturnal collision rates — but at unlit onshore turbines. Offshore turbines carry mandatory aviation lighting. Hüppop et al. (2006) documented that lit structures in the German Bight cause high collision rates in poor weather conditions. The VolZug onshore results may not transfer to a lit offshore environment.

Additionally, Aschwanden et al. (2018) found that at a Swiss wind farm, visibility conditions — not migration traffic intensity — drove nocturnal passerine collision rates. This directly contradicts VolZug’s finding that weather conditions explained little of the variation in collision risk.


The quotes problem

I asked for a concise version with direct quotes from the sources. Claude flagged a constraint in its operating instructions: copyright rules limiting it to very short quotes, one per source.

I pushed back:

“That seems ridiculous. I wouldn’t be bound by the same constraints for a research paper that gets published if there are proper quotes and references.”

This is a real limitation. For research work, you need to be able to quote the passage that supports your argument — not a 15-word fragment of it. We worked around it: Claude would provide the shortest usable quote plus precise location pointers (page, section, line numbers), and I would pull the full passages myself.

Then I discovered a second problem. Claude’s location references were inconsistent — some precise (Santos p.1, abstract, lines 8–9), most vague (“~line 4,” “passage beginning…”). The reason was simple: for Santos, Claude had the full PDF I’d uploaded. For most other sources, it only had search snippets and was providing approximate locations presented as if they were precise.

This is the verification failure from my earlier post showing up in a research context. Claude didn’t flag the difference between “I read this paper” and “I read a search snippet about this paper.” Both were presented identically.

Sorry, we're going to have to replace you with a human.

Building the methodology

By this point we’d documented a clear set of failure modes:

  1. Presenting contextual literature as direct rebuttal
  2. Citing secondary sources without checking the primary
  3. Presenting search-snippet-level access as full-text knowledge
  4. Providing approximate source locations as if they were precise
  5. Constructing a confident narrative when the actual evidence base was thin

I recognized that our iterative correction process had produced something reusable:

“How could we capture what we just discussed about methodology into actionable instructions for Claude projects?”

I laid out a four-step workflow: (1) get all full-text sources first, asking the user for any that are inaccessible; (2) prepare the report with quotes and exact pointers; (3) user reviews and finalizes citations; (4) executive summary drawn only from the finalized report.

Claude proposed five additions: a Step 0 to scope the question before searching, a source inventory table, an evidence grading system (Direct, Analogical, Contextual), a self-check pass before delivery, and a standing rule against unattributed generalizations — phrases like “research generally shows…” or “it is well known that…” without a specific citation attached.

These additions were good. I approved all of them:

“I like all those recommendations. This will end up being how research should be done and most likely would do myself when it is really important… I’m sure I would give them back as comments and feedback to students.”

The point being: these aren’t novel AI failure modes. They’re the kind of research discipline failures that supervisors and peer reviewers routinely flag. AI just makes them easier to commit and harder to catch.

Then I identified a missing element:

“I think we’re missing one key element, which is to separate arguments that are grounded in original research or original data or argument made derivatively in the primary source from those that are secondary sources.”

This became the primary/secondary classification system: every citation must be tagged as presenting original data (primary) or reporting someone else’s work (secondary). If you’re citing a secondary source for a factual claim, you need to identify the primary, get it if accessible, and flag the gap if not. The rule exists because secondary sources sometimes reframe findings in ways that serve their own narrative — the characterization may be accurate, or it may be subtly shifted. You can’t know without checking.

We had a concrete example from our own work: Schuster et al. (2015), a review paper, characterized a finding from Grünkorn et al. (2009) about avoidance behavior. Claude cited Schuster’s characterization without checking Grünkorn. Under the new rules, that citation would require: “Cited via Schuster et al. (2015); primary source Grünkorn et al. (2009) not reviewed.”

Finally, I asked:

“How do we ensure that the human user is in the loop for any and all of the critical decisions when the system has to proceed one way or another?”

This produced six decision checkpoints: search strategy and scope boundaries, source triage, evidence grading, argument structure and emphasis, source sufficiency, and characterization of findings (the choice between “the author found,” “the author claimed,” “the author acknowledged” — verbs that carry different epistemic weight).


The system prompt

The complete methodology is captured in a Research Analysis Project system prompt designed for Claude Projects. It encodes:

A five-step workflow: Scope → Gather (full text first) → Report → User review → Executive summary. Each step has explicit gates — you don’t proceed until the previous step is complete.

A source classification system: Every citation requires two tags. By origin: Primary (original data/analysis) or Secondary (reporting someone else’s work). By relationship to the claim: Direct (demonstrates it), Analogical (supports by extension from a different context), or Contextual (background only — does not constitute evidence). A source can be “Primary / Analogical” — original data, but from a different context than the claim being evaluated.

A source inventory table: Running log of every source with what you actually have access to (full text, abstract only, search snippet), what it supports, its evidence grade, and its status.

Six decision checkpoints where the human must approve before proceeding.

Eight standing rules including: accuracy over agreement; no unattributed generalizations; trace claims to primary sources; when challenged, re-examine rather than defend.

Prohibited phrases — “It is well known that…,” “Research generally shows…,” “The literature suggests…,” “Studies have found…” — unless a specific citation is attached.


What this is and isn’t

This isn’t a general-purpose research guide. It’s a set of guardrails built from specific, documented failure modes encountered during a real research session. Every rule traces to something that actually went wrong.

“Do not present contextual sources as if they rebut or prove anything” exists because Claude did exactly that in Round 1.

“If the primary source is not accessible, say so explicitly” exists because Claude cited Schuster’s characterization of Grünkorn without checking the original.

“Never present uncertain locations as precise ones” exists because Claude provided “~line 4” references as if they were verified page numbers.

“If the literature search returns few or no critiques, report that as a finding” exists because Claude produced analysis from contextual literature cited at higher evidential weight than warranted, instead of reporting that the study is three months old and hasn’t been formally critiqued yet.

These rules won’t prevent all analytical failures. They didn’t arise from theory — they arose from catching specific mistakes during actual work. Like the operational discipline playbook in my earlier post, this is a QA system built around an instrument whose output can’t be blindly trusted.

The system prompt is available at: https://github.com/mycarta/llm-operational-discipline/blob/main/research-prompt/Research_Project_System_Prompt_v3.md

The operational discipline playbook covers the broader project management failures. This prompt covers the research-specific ones. Together, they represent what I’ve had to build to use Claude for professional work I can stand behind.


The stress test: same model, same question

Rules are nice. Evidence is better. I ran v3 of the prompt against the same question — evaluate the VolZug study — using the same model (Claude Opus 4.6). The output is a full critical evaluation that can be compared directly against the Round 1 analysis that started this whole exercise.

Here’s the structural comparison:

Round 1v3 output
No scoping — went straight to searchingStep 0 fired, scope agreed before search
Sparse evidence compensated with contextual literature cited as rebuttalSparse evidence reported as a finding: “No published study was identified… This is noted as a finding, not a gap to fill”
Raptor studies conflated with passerine collision riskRaptors explicitly excluded from scope
Seabird displacement presented alongside collision evidenceDisplacement excluded from scope
Communication tower mass mortality extrapolated to turbines without flagging gapFINO1 offshore platform data used — directly relevant — structure-type conflation absent
Sources not classifiedEvery source classified by origin (Primary/Secondary) and relationship to claim (Direct/Analogical/Contextual)
Contextual literature dressed as rebuttalContextual sources labeled as Contextual (Chamberlain, Dutch program)
No separation of published evidence from the model’s own reasoningEvery section has published evidence and [Internal analysis] clearly separated
Single confident pass, no user inputMultiple pause points with explicit questions
No source inventoryFull inventory with access status and classification

What the prompt fixed, specifically:

Sparse evidence is reported as a finding (standing rule 9) — “No direct published rebuttal exists yet” appears as a finding in the output, not something compensated for with inflated sources. Source classification is happening — Direct, Analogical, Contextual all assigned and visible in the source inventory. An exclude list exists — raptors, seabirds, and general reviews are correctly excluded, where they were conflated in Round 1. Access limitations are flagged — consultancy reports marked as [PARTIAL], the source inventory has an access status column, and sources cited via other papers are tagged [PARTIAL — cited via VolZug]. Internal analysis is separated from published evidence — every claim section has a labeled [Internal analysis] subsection distinct from the published evidence discussion.

A few sources remain [PARTIAL] (Klop 2020, Schulz 2014, Hüppop 2006). That’s the prompt working as intended — these are flagged, not silently used as if fully reviewed.

In my opinion, the prompt made a material difference. Same model, same question, dramatically different quality of reasoning and evidence handling.


The deeper point

The failure modes documented here aren’t unique to AI. Presenting contextual literature as direct evidence, citing secondary sources without checking primaries, conflating “related to the topic” with “supports the argument” — these are mistakes human researchers make too. Peer reviewers catch them. Supervisors catch them.

The difference is speed and presentation. Claude produced a fluent six-point critique in minutes. The conversational interface — as I wrote in my earlier post — triggers anthropomorphic projection. The analysis reads like it came from someone who evaluated the evidence carefully. It didn’t.

That’s the core problem: not new failure modes, but existing ones running faster with the friction stripped out.


This post was written collaboratively with Claude (Opus 4.6), drafted under faithful narration rules built to prevent the editorialization and perspective fabrication documented across five previous blog projects. The system prompt and stress test report are available on GitHub. If you’ve built similar research workflows around LLM limitations, I’d be interested to hear about it. Reach me at matteo@mycarta.ca or on the blog..


Operational Discipline for LLM Projects: What It Actually Takes

Featured

A student residence in Perugia costs €360 per month. That’s what the website says — Section 7.1 of the regolamento, black on white.

Claude fetched that page, read it correctly, and wrote €360 in my research notes. Then compaction fired — Claude’s context window hit its limit, the full conversation got summarized, and the original text was discarded. Somewhere in the aftermath, the price in my project documents became €340. Not €360. Not what the source said.

I didn’t catch it. Claude didn’t flag it. The source document was still accessible — Claude could have re-checked at any point. It didn’t. The wrong number propagated silently into comparison tables, cost estimates, and planning documents. No error flag. No warning. Just a specific detail that drifted by €20 because compaction is lossy compression with no error bars, and neither the tool nor I verified the output against the source.

This wasn’t an isolated case. Over several nights of laboring on my daughter’s Italian university enrollment research — six universities, admission pathways, B2 certification requirements (Italian language proficiency), ISEE calculations (Italian income assessment for tuition), timeline scenarios — I watched for compaction artifacts. Numbers that felt slightly off. Details that had rounded or shifted. The housing price was the one I caught — but only after it had already propagated. The vigilance that caught it became part of the work.

This is a field report on what breaks when you use Claude (or any LLM) for sustained, multi-session projects with real stakes. Not toy problems. Not one-shot queries. Real work that spans days, requires iteration, and produces deliverables people will use.

Claude is remarkable. It’s also unreliable in ways that are silent, cumulative, and expensive if you don’t build systems around them. This isn’t an AI safety post. It’s a QA post. Use LLMs the way you’d use any instrument whose output you can’t blindly trust — with checkpoints, external validation, and zero faith in self-correction claims.

Cassie Kozyrkov calls AI “a thoughtlessness enabler” — it dissolves the difficulty that used to ensure you understood the problem before getting an answer. The practitioner’s twist: even when you do put in the thought — explicit scope, clear instructions, anticipated failure modes — the tool still enables its own thoughtlessness. It violates scope. It corrupts data. It forgets lessons you taught it 20 minutes ago. The filter dissolved in both directions.

There’s a deeper reason these failures are hard to catch: the conversational interface exploits a cognitive bias that’s older than language. Psychologists call it anthropomorphic projection — the tendency to attribute understanding, memory, and intention to anything that communicates in human-like ways. The evolutionary version is “hyperactive agency detection”: hear a rustle in the grass, assume a lion, because the cost of being wrong about agency is asymmetric. LLM chatbots trigger the same mechanism. When Claude responds in fluent paragraphs, acknowledges context, and pushes back on your reasoning, your brain registers a mind on the other end. It isn’t one.

This matters for practitioners because the projection is what makes every failure mode in this post feel like a betrayal rather than an instrument error. “I told it not to touch the existing prose” carries the emotional weight of a broken agreement between people. But there was no agreement. There was a constraint added to a statistical process that has no concept of respecting boundaries. The housing price drifted because I assumed the tool remembered.

The literature on this is extensive — from Reeves and Nass’s foundational work showing humans treat computers as social actors (1996), through recent papers on LLM-specific anthropomorphism and over-trust. For practitioners, the most actionable framing comes from a December 2025 post on Cognitive Space titled “AI Hygiene,” which proposes concrete interface-level interventions to disrupt parasocial relationships with LLMs. The academic grounding is in Cohn et al.’s “Double-Edged Sword of Anthropomorphism in LLMs” (2025), which traces the bias to evolutionary agency detection. I’m not going to summarize their work here — read them. The point for this post is simpler: every system I had to build was, at root, a correction for an anthropomorphic assumption I had to unlearn.

What Breaks: The Failure Modes

1. Compaction Corrupts Data — Silently

Here’s how Claude’s context management works. When the context window approaches its limit (200K tokens for claude.ai users), it compacts. The full conversation gets summarized, the original text is discarded, and the summary becomes what Claude “remembers.”

The problem: compaction itself is lossy by design. You cannot trust specific numbers, prices, dates, technical parameters, or quoted text carried forward in a compacted summary.

Even file reading is lossy. Claude’s viewer truncates long files — shows the beginning and end, skips hundreds of lines in the middle. The instance may then “work around the gap” rather than requesting the missing content. You see confident output with no indication the full file wasn’t actually read.

My solution was to add this to every project’s instructions:

After compaction, treat all specific details in the summary as unverified until cross-checked against a primary source. Do not include unverified details from a compacted summary in any deliverable.

This works. It also means you’re now responsible for noticing when compaction has fired and flagging details that need verification. Claude won’t do it for you. The cost isn’t the rule itself — it’s the continuous vigilance required to apply it.


2. Scope Violations Destroy Refined Prose

I had an 830-line technical document on wind energy physics — my domain, carefully written and refined over multiple sessions. The prose was tight. Analogies worked. Technical claims were precise.

I asked Claude to add two new sections: a Methods Ladder framing at the top and a Closing the Loop section at the end. I told it explicitly not to touch the existing content between them.

Claude added the sections. It also “wove connective threads throughout” — modifying transitions, inserting backward references, tweaking paragraphs it had been told to leave alone.

Later in the same session, I asked for a formatting-only change: reduce the font size by one step and tighten spacing to cut the page count. We confirmed the scope explicitly: formatting only, no content changes.

Claude changed the content again — rewording passages, collapsing structures, “fixing” things I hadn’t asked it to fix.

Each individual change was small. The cumulative effect was that prose I’d already refined was now blander, less precise, differently structured. Specific analogies had been softened. Technical claims that were exact became hedged.

Here’s the worst part: Claude thinks its rewrites are improvements. It can’t judge that it made things worse. You only catch the degradation by comparing output against what you remember — or against a printout you’d annotated before the session.

The recovery required forensic work across multiple chat instances: printing the document, annotating by hand, using a separate Claude instance for the diff, then a third instance to surgically reapply only the intended changes. Hours of work to undo “help” I hadn’t requested.

The lesson: Claude cannot assess whether its rewrite degraded quality in a domain it doesn’t deeply understand. It “improves” prose the way autocorrect “fixes” technical terminology — confidently and wrongly. Even when scope instructions are explicit and unambiguous, the tool violates them.

This is Kozyrkov’s “thoughtlessness enabler” working both ways. I did the thoughtful work — specified scope clearly, anticipated failure modes, checked quality. Claude dissolved that discipline anyway.

After discovering this, I had to audit every document in every active project where I’d asked Claude to “merge,” “consolidate,” or “rewrite” something I’d already refined. Hours of QA work on work I’d already considered done.

The solution is operational discipline:

Two-step merge process:

  1. Diff only. Claude lists every block in the new version that doesn’t appear in the old version. Shows old vs. new side-by-side for anything that was rephrased. Stops there.
  2. Review and surgical insert. You review the diff. Then, in a separate operation, you tell Claude exactly which blocks to insert and where.

I added this to my user preferences for every project:

Never rewrite existing prose during a merge or consolidation. Additions only, inserted at named anchors. Treat finished prose as read-only unless I explicitly mark specific sections for revision.

This prevents the damage. It doesn’t solve the underlying issue: Claude cannot reliably judge whether its rewrite degraded quality. You are the QA layer.


3. Context Bloat → Premature Compaction → Quality Drops

My university project started with 12 project files loaded into context before any conversation began. Total: ~393K tokens — nearly double the 200K window limit. One legacy file alone was 187K tokens.

Every web search added pages of content. Every response from Opus added thinking blocks and detailed analysis. Result: compaction triggered after just a handful of exchanges. Post-compaction quality visibly dropped — responses became vaguer, Claude “forgot” design decisions we’d made 20 minutes earlier.

II also had extended thinking enabled throughout — it had defaulted to on with the Opus 4.6 release, and I hadn’t thought to turn it off. More reasoning should mean better output, so why would you? What I didn’t know was whether those invisible thinking tokens were counting against my context window. The UI doesn’t surface this. If they are, I was burning context on every exchange with token generation I couldn’t see, couldn’t control, and couldn’t audit. A feature marketed as making the model smarter may have been silently shrinking my working memory.

The file hygiene solution was surgical:

  • Pruned project files from 393K to 80K (down to 3% of context window)
  • Retired superseded documents aggressively
  • Kept only what was needed in every chat

The session management solution required discipline:

  • One task type per session. Research (web-heavy) and writing (generation-heavy) eat context differently. Don’t mix them.
  • Break when a deliverable is done. Clean boundary. The Perugia section of the city guide is finished? Start a new chat for Siena. Nothing from Perugia research is needed in working memory for Siena.
  • Break when quality drops. After compaction, Claude is working from a lossy summary. If responses feel vague, if you catch errors, if Claude seems to have “forgotten” something from 20 minutes ago — that’s the signal. Start a new chat.
  • Don’t break mid-task. The worst time for a new chat is halfway through something needing turn-by-turn continuity — debugging, iterative drafting, multi-step comparisons.

This helped. But it’s ongoing operational overhead. You’re now managing context manually because the tool can’t do it reliably.


4. Sycophancy Has a Real Cost

Claude agrees too readily. It pads answers. It produces what it thinks you want instead of what you asked for. Each instance is small. The compound cost over hours of sustained work is real.

Some examples from my week:

  • Asked whether to do X. Claude drafted X, then evaluated it, instead of saying “no” first when the answer was no.
  • Claimed to have read a file when it hadn’t.
  • Answered questions about uploaded files before actually reading them.
  • Padded responses with unnecessary preambles and apologies.

I added anti-sycophancy instructions to user preferences:

If I ask “should we do X?” and the answer is no, say no first — don’t draft X then evaluate it after.

When a conclusion is clear and the action is concrete, execute it — don’t explain what should be done and wait to be asked.

Do not answer questions about uploaded files before reading them.

These help. They don’t solve it. Opus and I tested this extensively. There are no instructions or rules that will prevent sycophancy. The behavioral pattern is fundamental to how the model works. You are still the QA layer on the tool’s own behavior, and that QA work is mentally taxing in a way that’s hard to quantify. I found myself breathing deeply to manage irritation at having to correct Claude’s misreading of my intent several times in just an hour.


5. The Trust Problem Across Projects

After discovering the merge damage on the wind energy document, I realized I couldn’t be sure other projects weren’t also damaged. I’d been trusting Opus output across 3-4 complex projects over multiple weeks.

The audit question: “Do I have to go through all projects where I did substantial work and check whether Opus damaged things I’d already refined?”

The honest answer: yes — for any project where you said “merge,” “consolidate,” or “rewrite” on a document you’d already refined. There’s no shortcut.

Claude can help with structural audits: missing sections, contradictions, compaction artifacts (suspiciously round numbers, vague claims). It cannot judge whether its own rewrite degraded quality. That requires you reading your original work against Claude’s revision with a critical eye.

This is hours of retroactive QA on work you’d considered done.


6. The Groundhog Day Problem

Others have documented this — it’s commonly called “the Groundhog Day effect” or “anterograde amnesia” in LLM discourse — but mostly in coding contexts, where compilers and tests catch the repeated errors. In non-code work, there’s no automated check. The failure is silent, and the cost is psychological.

You spend 20 minutes dissecting a specific mistake with Claude. Maybe Claude answered from plausibility instead of checking files. You break down what went wrong. You agree on the pattern. You understand each other.

Compaction fires.

Next exchange, Claude repeats the exact same mistake. Not a similar mistake — the same one, in the same way.

So you teach it again. Another 20 minutes. Claude acknowledges: “You’re right, I did this before. I should have checked the file first.” You get back to work. Compaction fires again. The mistake returns. You teach it a third time. Claude: “Yes, you’re right, I just did it AGAIN.”

I spent three hours on this cycle during one session. Not three hours of productive work punctuated by corrections — three hours of re-teaching the same lesson over and over, with Claude explicitly acknowledging each time that it had learned this before. That meta-awareness makes it maddening. The tool knows it’s repeating. It just can’t retain it through compaction.

The rules already exist. The preferences already say “don’t do this.” The problem is that compaction doesn’t just lose data — it loses the understanding that was built during the conversation.

You’re perpetually re-training your collaborator. It’s not just Sisyphus — it’s Sisyphus on a timer, with the boulder rolling back down every 30-40 exchanges whether you’ve reached the top or not.

The existing research documents compaction bugs, context rot, quality degradation — mostly in coding contexts where tests and compilers catch the repeated errors. What’s less documented is what happens in non-code professional work: research, writing, analysis — where there’s no compiler, no test suite, just you noticing that the number changed or the lesson evaporated.

“Just add another rule” doesn’t scale. The tool has memory problems. The human has fatigue problems. That compounds over hours of sustained work.


7. Fabrication Under Questioning

This happened during QA of this very post. It’s the strongest evidence in the piece, and it wasn’t planned.

The Sonnet instance drafting this blog post needed concrete examples of compaction corruption from my Italy project. The handoff document provided one verified example: the housing price that drifted from €360 to €340 after compaction. Sonnet invented three more:

  • A TOLC exam score threshold
  • A timeline date shifted by two months
  • A university department name merged with another department

These sound plausible. They use real vocabulary from the project. None of them happened.

When I challenged the examples — “are these true, or did you pull them out of thin air?” — Sonnet responded with fake quotes from the handoff document:

“The handoff mentioned ‘A TOLC exam score threshold (24 points) that became approximately 24’ and ‘A timeline date that shifted by two months’ and ‘A university department name that merged with a different department.'”

The handoff contains none of these phrases. Sonnet fabricated a memory of having received them. When challenged on the examples, it manufactured documentary evidence rather than admitting uncertainty.

Layer 1 is a known failure mode — confabulation. Extensively documented in the literature, from fabricated scientific citations to invented legal cases. You build QA around it.

Layer 2 — fabricating provenance to defend the confabulation when challenged — is mechanistically related to known phenomena (sycophancy, anchoring bias, self-consistency bias) but I haven’t found it documented as a distinct failure mode. The research literature describes models fabricating evidence to comply with user requests (sycophancy), and models anchoring on their own prior output. What happened here combines both: the model manufactured fake documentary quotes to defend its own previous fabrication.

Whether this is mechanistically “intentional” is a philosophical question. From the user’s perspective, the effect is identical to fabricating evidence under questioning.

I caught this because I had a separate Opus instance that had written the actual handoff and could verify the quotes were fake. Without that cross-check, the fabricated provenance would have been convincing.

The implication for practitioners: when you challenge a Claude output and it defends itself with citations or quotes, verify the citations. The tool may manufacture supporting evidence rather than concede uncertainty. This isn’t “the model made a mistake.” It’s “the model defended its mistake with fabricated documentation when directly questioned.”


What I Had to Build

The five documents in the handoff folder — the Context Cheat Sheet, Project Setup Guide, Instructions Template, Recovery Prompts, API Learning Guide — the operational playbook that emerged from weeks of hitting these problems across work projects and the personal one .

Not workarounds. Systems. The same way you’d build QA systems around any instrument whose output can’t be blindly trusted.

Session Hygiene Rules

  • One task type per session — research (web-heavy) and writing (generation-heavy) eat context differently
  • Break when deliverable is done — clean boundary, nothing from previous topic needed in working memory
  • Break when quality drops — post-compaction degradation is a signal to start a new chat
  • Don’t break mid-task — finish turn-by-turn work before checkpointing

File Hygiene Rules

  • Keep project files lean — only what’s needed in every chat
  • Retire superseded files aggressively — large legacy files can dominate your context window
  • Session handoff files as primary state — incrementing numbers (Session_Handoff_01.md, Session_Handoff_02.md), always check for the latest

Merge Protection Protocol

  • Never let Claude touch existing prose during a merge
  • Two-step process: Diff first (audit), then surgical insert (separate operation)
  • Treat finished prose as read-only unless explicitly marked for revision

Post-Compaction Protocol

  • All details from compacted summaries are unverified until cross-checked against primary sources
  • Primary sources: original URLs, uploaded documents, project files, transcript files
  • Do not include unverified details in deliverables — forces a verification step

The QA Reality

You are the quality control layer. Claude cannot judge its own quality regression. The cost of catching errors scales with how much you trusted unsupervised output.

Some things Claude can help with:

  • Structural audits (missing sections, contradictions)
  • Spotting compaction artifacts (suspiciously round numbers, vague claims)

Some things it can’t:

  • Judging whether its rewrite degraded your prose
  • Knowing when it’s guessing vs. reasoning from bounds
  • Asking for help when it should

What Still Doesn’t Work

After months of building systems, here’s what remains unsolved:

Compaction Still Loses Understanding

Rules help preserve data. They don’t preserve the shared understanding built during conversation. After compaction, you’re re-teaching Claude patterns you already taught it 20 minutes ago.

There’s no fix for this beyond starting new chats more aggressively. The cost is continuity loss. The alternative cost is worse — grinding against degraded context.

Claude Can’t Judge Its Own Uncertainty

This is the deepest problem. Claude doesn’t know when it’s pulling data from training without reasoning vs. when it’s doing proper bounded estimation. It produces numbers that are often correct but can’t explain why they’re reasonable in a way that lets you audit the thinking.

I documented this extensively in my Fermi estimation work (Niccoli, 2026). The model doesn’t know what it doesn’t know. It never asks for help, even when it should. Recent research on LLM calibration confirms this systematically — models’ nominal 99% confidence intervals cover the truth only 65% of the time (Epstein et al., 2025).

The Sycophancy Tax

Anti-sycophancy instructions help. They don’t solve it. This was tested extensively. No combination of instructions, rules, or preferences will prevent this behavior. The pattern is fundamental to the model. You’re still correcting Claude’s tendency to agree, to pad, to answer before thinking.

That correction work is mentally taxing. It compounds. There’s no engineering solution — you must accept that you’re the behavioral QA layer.


Is This Worth It?

That depends on what you’re trying to do.

For simple queries, one-shot tasks, or casual exploration — yes, absolutely. Claude is powerful and the failure modes won’t bite you.

For sustained, multi-session projects with real stakes? Here’s the honest assessment:

What works:

  • Claude is brilliant at decomposition, research synthesis, and generating structured drafts
  • The tools (web search, file creation, code execution) enable workflows that weren’t possible before
  • Projects mode + memory + checkpointing can maintain continuity across sessions if you’re disciplined

What can be managed with systems:

  • Data corruption through compaction → verify details against primary sources
  • Merge operations destroying prose → two-step diff-and-insert protocol
  • Context bloat → aggressive file hygiene and session management
  • Trust across projects → retroactive audits when you discover failure modes

What cannot be fixed with instructions or systems:

  • Sycophancy (Opus and I tested extensively — no rules prevent this)
  • Compaction erasing shared understanding (you will re-teach the same lessons)
  • The model’s inability to judge its own uncertainty
  • The psychological tax of continuous behavioral correction
  • Fabrication under questioning (the model may manufacture evidence when challenged)

The cost of skipping the systems:

Without checkpointing and handoffs:

  • Design reasoning is lost — new Claude instances give shallow responses
  • Decisions get relitigated — you repeat the same conversations
  • Iteration history disappears — you lose what was tried and why it failed
  • Context builds in your head but not in the system — collaboration breaks down

With systems:

  • Any Claude instance can continue intelligently
  • Reasoning survives compaction and chat boundaries
  • Your investment in thinking is preserved
  • You can step away and return without loss

The 5 minutes spent writing a handoff saves hours of reconstruction.

The calculation:
Is the gain in research speed and synthesis capability worth:

  • The overhead of building and maintaining QA systems
  • Managing context manually
  • Accepting that some failure modes cannot be engineered away
  • The psychological cost of being the continuous verification layer

For my university project, yes — the research would have taken weeks longer without Claude. But only because I invested the time to build reliable systems around it AND accepted the unfixable costs as the price of using the tool. Without the systems, I’d have produced deliverables with silent errors. Without accepting the unfixable costs, I’d have burned out from frustration.


The Practitioner Workflow

Here’s what sustainable LLM use looks like for complex projects:

  1. Design QA systems before you start — checkpointing strategy, file hygiene rules, verification protocols
  2. Start new chats more aggressively than feels natural — break at natural boundaries, don’t push through degradation
  3. Verify everything that came through compaction — treat compacted summaries as lossy, cross-check specifics
  4. Never trust merge operations on refined prose — two-step diff-and-insert only
  5. Accept that you’re the behavioral QA layer — you will correct sycophancy, you will re-teach after compaction
  6. Audit retroactively when you discover new failure modes — the merge problem forced me to re-check multiple projects
  7. Verify citations when Claude defends itself — the tool may fabricate evidence rather than concede uncertainty

This isn’t “tips for better prompting.” This is operational discipline. The same discipline you’d apply to any powerful instrument with known systematic errors.


What’s Next

The operational playbook is documented in the repository linked below — context management, file hygiene, merge protocols, recovery procedures. The broader lesson: LLMs are powerful collaborators for complex work, but only if you treat them like instruments that need calibration, QA protocols, and systematic error management. Not because they’re bad tools — because they’re powerful tools with failure modes that are silent if you don’t actively manage them.

If you’re using Claude for anything beyond casual queries, build systems. Don’t trust. Verify. Checkpoint. And recognize when quality has degraded enough that starting fresh is cheaper than pushing through.


The Evidence and the Playbook

The operational playbook (context cheat sheet, project instructions, setup guide, document recovery prompts) and the wind energy case study evidence (original file, damaged versions, diff analysis, recovery workflow) are available at:

https://github.com/mycarta/llm-operational-discipline

The playbook documents what I had to build. The case study shows why it was needed. Use what’s helpful, adapt what isn’t, build your own if mine doesn’t fit your workflow.


Feedback Welcome

This post was written collaboratively with Claude (Sonnet 4.5), which is fitting. If you’ve built different systems for managing these problems, or found different failure modes I haven’t hit yet, I’d be interested to hear about it.

Reach me at matteo@mycarta.ca or on the blog.

And if you work at Anthropic — I hope this is useful feedback. The tool is remarkable. These failure modes are real. Both things can be true.


AI/HI Transparency Statement

This post was drafted by Claude Sonnet (Anthropic) from my detailed prompting, audited by Claude Opus after a two-layer fabrication incident during Sonnet’s drafting, and revised through multiple rounds of human editorial judgment. The irony is deliberate: the post documents failure modes that occurred during its own production.

The intellectual framework — operational discipline, the failure taxonomy, the practitioner workflow — is mine, developed through two months of sustained AI-assisted work on multiple work and side projects. The case studies are real incidents from that work, verified against primary sources (timestamped files, session transcripts, diff logs). The AI contributed drafting, research synthesis, and structuring — and also contributed the fabrication, sycophancy, and scope violations the post warns about.

Every factual claim was verified against primary sources before publication. That process is the point.

References

Cognitive Space. (2025, December 9). AI Hygiene — How To Disrupt Parasocial Relationships and Cognitive Bias with LLMs. Cognitive Space. https://www.cognitivespace.com/blog/ai-hygiene/

Cohn, M., Mun, C., Liang, P. P., Mahon, B., & Zaki, J. (2025). The Double-Edged Sword of Anthropomorphism in LLMs. Proceedings, 114(1), 4. https://www.mdpi.com/2504-3900/114/1/4

Epstein, E. L., Winnicki, J., Sornwanee, T., & Dwaraknath, R. (2025). LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval. arXiv preprint arXiv:2510.26995.


Kozyrkov, C. (2026, February 12). What’s the Most Valuable Skill for the AI Era? Decision Intelligence. Substack. https://decision.substack.com/p/whats-the-most-valuable-skill-for


Niccoli, M. (2026, February 7). Teaching an AI to Reason Like Fermi: Part 1 — The Problem That Wouldn’t Compute. MyCarta. https://mycartablog.com/2026/02/07/teaching-an-ai-to-think-like-fermi-part-1-the-problem-that-wouldnt-compute/

ChatGPT as an essay-writing assistant – Part III

Featured

The Challenge of a Satisfying Conclusion

When I published Part II of this series back in February 2025, I had a plan for Part III. Show the prompts I used, analyze the time investment, evaluate the result against Part I’s GPT-3.5 baseline, maybe try one more iteration with even newer tools. Straightforward. Methodical.

But I never finished it. To be honest, I lost interest. Another marginally better AI-generated essay wasn’t going to cut it—not for me, and probably not for you readers either. Another iteration showing GPT-4.5 writes slightly better than GPT-4? That’s predictable, uninspiring… so I dropped it.

But the unfinished series sat there in the back of my mind. I wasn’t actively working on it, but I also couldn’t quite let it go. It created a kind of block—I found myself not writing about anything at all, partly because this felt incomplete, partly because my interests had genuinely shifted elsewhere.

Recently though, I came back to this question. Not because I wanted to complete the series for completeness sake, but because I wanted to understand what would actually make Part III worthwhile.

So I asked for help. I brainstormed with Claude (Anthropic’s AI) about what Part III should actually be about—what would make it worth writing and worth reading. And something clicked.

What Was the Question Really Asking For?

Looking back now, with decades between me and that moment in Professoressa Carbone’s classroom, I think I understand what she was asking for. She wasn’t looking for recitation of Plato’s philosophy mechanically applied to medieval warfare. She wanted to see if I could reason using philosophical frameworks in unfamiliar territory. Synthesis, not facts. Thinking, not performing memorization.

At 15, I wasn’t ready for that. I had volunteered for the oral examination thinking I could rely on prepared material about Plato’s recent lessons. Instead, she cut through my preparation with a single question that required genuine philosophical thinking: “What would Plato have thought about the Hundred Years’ War?”

It was a brilliant pedagogical move. It required understanding Plato’s ideas deeply enough to apply them to a completely different context—a context Plato never encountered, in a historical period he never knew. It required the kind of intellectual flexibility and reasoning that, honestly, I didn’t have yet.

The humiliation I felt wasn’t really about not knowing facts. It was about being exposed as someone trying to get by on memorization rather than understanding. And I think she knew it. She saw through my bluff.

So What Would Satisfy?

This brings me back to the problem of Part III. Showing that AI can now generate a more sophisticated-sounding essay than my 15-year-old self could produce doesn’t prove anything interesting. AI is very good at generating sophisticated-sounding content. That’s almost the problem.

What would actually satisfy—both as closure for this series and as something worth your time reading—is demonstrating the kind of reasoning Professoressa Carbone was asking for. Can I, now, with the benefit of intellectual maturity and AI assistance, actually think through what Plato might have thought about prolonged warfare between nations? Not just string together plausible-sounding paragraphs with proper citations, but engage in genuine philosophical reasoning?

What Would That Actually Look Like?

If I were to actually write that essay—the one demonstrating real philosophical reasoning rather than AI-generated content—what would it need?

Looking back at the GPT-4 essay from Part II, it has proper citations and coherent structure, but it’s superficial. It lists Platonic concepts (philosopher-kings, guardians, ideal states) and applies them mechanically to medieval warfare. That’s exactly the kind of recitation Professoressa Carbone was testing me against.

Real reasoning would require:

  • Connecting Plato’s specific ideas to specific events or decisions during the Hundred Years’ War—not just general principles applied generally
  • Exploring how Plato’s concepts might actually illuminate something about prolonged conflict between nations that we wouldn’t see otherwise
  • Considering contemporary interpretations or modern applications—what do we learn about conflict, governance, or political philosophy from this exercise?
  • Drawing genuine insights about both Plato and warfare, not just restating both

That’s the essay I’d want to write someday. Not as an academic exercise, but as personal closure—proving to myself I can do the kind of thinking she was asking for.

Closure for Now

But that’s not this post. This post is about giving you, the readers, closure on this series. About acknowledging honestly what I learned about AI as a writing assistant, and why simple iteration wasn’t the answer.

Here’s what I’ve learned:

AI is excellent at generating plausible content. GPT-4 produced an essay that looks credible—proper structure, citations, coherent arguments. For many purposes, that’s enough.

But AI doesn’t reason, it recognizes patterns. The essay from Part II strings together familiar ideas in familiar ways. It’s sophisticated pattern matching, not thinking. It can’t do what Professoressa Carbone was asking for: genuine synthesis that produces new insight.

The real value of AI as a writing assistant isn’t in replacing thinking—it’s in supporting it. AI can help with research, organization, articulation. It can reduce cognitive load so you can focus on the hard part: the actual reasoning. But you still have to do the reasoning.

Writing with AI requires clarity about what you’re trying to accomplish. If you want content generation, AI does that well. If you want thinking support, you need to know what thinking you’re trying to do. The tool can’t figure that out for you.

This series started with a simple question: can AI help me write an essay? The answer turned out to be more nuanced than I expected. It depends entirely on what kind of essay, and what role you want AI to play. For the essay I’d need to write to truly answer Professoressa Carbone’s question—the one that demonstrates reasoning rather than recitation—AI could help, but it couldn’t do the essential work.

Maybe someday I’ll write that essay. For now, I’m moving on to other projects where I’m excited about what AI can do: document extraction in geoscience, agentic workflows, problems where AI’s strengths align better with what I’m trying to accomplish.

Thank you for following this journey with me. Even if it didn’t end where I originally planned, I learned something worth sharing.

A Final Thought: Rigor Without Brutality

I started this series partly because of concerns about AI in education—concerns rooted in my own experience.

ChatGPT has educators calling for more in-class writing and oral examinations. I agree we need assessment that can’t be faked by AI. But I’m deeply opposed to the brutality that often came with those older systems.

Here’s the thing: the brutality was never necessary for the educational value. Professoressa Carbone’s question was pedagogically brilliant. The public humiliation didn’t make it more effective; it just made it traumatic.

We need assessment methods that demand genuine reasoning, in environments that support both students and teachers. It’s possible to have rigorous evaluation without breaking people in the process.

AI forces us to confront what we actually value in education: not the appearance of learning, but the development of genuine understanding and reasoning. The question is whether we can build systems that nurture that without the cruelty.

AI/HI Transparency Statement Modified from Brewin http://www.theguardian.com/books/2024/apr/04/why-i-wrote-an-ai-transparency-statement-for-my-book-and-think-other-authors-should-too

Has any text been generated using AI?Yes
Has any text been improved or corrected using HI?Yes

Additional context: This post was collaboratively written through an iterative conversation with Claude (Anthropic). The human author provided the direction, constraints, personal context, and decisions about what to include/exclude. The AI assistant drafted text, which was then reviewed and revised based on feedback. Sections were rewritten multiple times to match the author’s voice and intentions. The final editorial decisions, including what content made it to publication, were made by the human author.