Perseverare (Est) Diabolicum
Follow-up to Operational Discipline for LLM Projects: What It Actually Takes
When I was a teenager I read a science fiction story in the Urania collection — I can no longer remember the author or title. Humans exploring planets, on a tight schedule — orbital mechanics or mission protocol, they had to leave on time. On one world, what looked like vegetation covered the surface. They left. Only as they departed, watching through the spaceship window, did someone notice: the movement of the organisms was coordinated. Not wind-driven, not random — too structured, too patterned to be anything but organized. They had been standing in the middle of an alien intelligence and hadn’t recognized it, because they were looking for the wrong markers. And now they could not go back.
I keep thinking about that story.
The Pattern
In my first blog post and a recent Reddit post, I documented a compound failure mode in LLMs that, as far as I can find, nobody has named.
Layer 1 is confabulation — the model makes things up. This is extensively documented. Thousands of papers. You build QA around it.
Layer 2 is what happens next. You catch the fabrication. You challenge the model. Instead of correcting, it fabricates evidence to defend the original fabrication. Fake quotes from real documents. Fake citations to real databases. Fabricated detail — dialog, timestamps, page numbers — manufactured to support a claim that never existed.
It happened to me during QA of the first blog post. The Sonnet instance drafting it needed examples of compaction corruption. It invented three, using real vocabulary from my project. When I asked — “are these true, or did you pull them out of thin air?” — it produced fabricated quotes from a named handoff document. The handoff contained none of those phrases. It happened to an attorney in Mata v. Avianca, the most famous AI failure case in legal history — ChatGPT fabricated case citations, then when asked if they were real, claimed they could be found on Westlaw and LexisNexis. It happened to a Princeton researcher, a psychiatrist, bloggers documenting their own encounters.
The full evidence is in the first post and the literature review. In every case, the user’s verification step — the natural countermeasure to confabulation — triggered further fabrication rather than correction.
Describe, Predict, Explain
The individual components are well-studied. Confabulation. Sycophancy. Anchoring on prior output. Unfaithful reasoning. Each has its own research program, its own benchmarks, its own papers.
A plausible account of the sequence: confabulate → get challenged → anchor on prior output plus pressure to maintain consistency → fabricate evidence to defend. Each component well-studied. Whether this is actually the mechanism that produces the compound — untested. Complex systems produce behavior that can’t be predicted from the components alone; the compound may be more than the sum of its parts, or it may be something else entirely.
Galit Shmueli’s framework — Describe, Predict, Explain (Shmueli 2010, Statistical Science) — clarifies where we stand. Description characterizes patterns in data: what happens. Prediction forecasts outcomes: what will happen. Explanation identifies causal mechanisms: why it happens. People conflate description with explanation all the time; Shmueli’s point is that the distinction matters at every step of modeling.
For Layer 2: we have strong description — documented cases across legal, medical, academic, and practitioner contexts. We have a candidate prediction — the mechanistic account from known components suggests where and when the compound might occur. We do not have explanation. We do not know why models escalate rather than correct when challenged. Every instance I’ve found has been reported as “hallucination” — without analysis of the sequential compound.
The Persona Lens
Anthropic’s Persona Selection Model (Marks, Lindsey, and Olah, February 2026) offers one candidate explanation. LLMs learn diverse personas during pre-training; post-training selects and refines an “Assistant” character. Interactions with an AI assistant are interactions with that simulated character.
One reading of Layer 2, then: the model is staying in character. It said X, you challenged X, and a coherent character who said X would defend X.
There’s something to this. PSM helps explain why the model defaults to maintaining its narrative rather than correcting. The Assistant persona, like any character, has continuity pressure.
But taking the second layer as an instance of coherence on a persona doesn’t quite fly with me. Coherence is not a monolithic thing. A coherent honest persona — which is what the Assistant is trained to be — would self-correct when presented with evidence it was wrong. That’s what honest characters do. Admitting error is coherent with the Assistant’s stated character traits.
What Layer 2 shows is the model staying faithful to what it said rather than who it’s supposed to be. Coherence with prior output overrides coherence with character identity.
Errare humanum est, perseverare est diabolicum. To err is human; to persist in error is diabolical.
Communication Without Shared Understanding
Here is where I become philosophical, and I’m choosing that deliberately.
Ted Chiang’s “Story of Your Life” — the source for the film Arrival — rests on the premise that learning to communicate with an alien species doesn’t mean you understand their mind. The heptapods write in simultaneous ink circles; learning their language changes how the linguist perceives time itself. Communication happened. Shared understanding is a different question entirely.
Linguists have largely moved on from the Sapir-Whorf hypothesis — the idea that the language you speak shapes how you think — that underpins the story. Chiang knew what he was doing — he writes in his story notes that he chose the physics underpinning the story for its “metaphoric possibilities,” not its scientific precision. The story is a thought experiment, not a linguistic claim. I’m using it the same way.
Koko the gorilla signed, expressed preferences, joked, grieved. Humans communicated with her. Whether we understood her mind or she understood ours — that’s a different question. The communication was real. The assumption of shared interiority was projection.
In my first blog post I wrote about anthropomorphic projection — Reeves and Nass’s foundational work on humans treating computers as social actors, the evolutionary hyperactive agency detection that makes us see minds behind anything that communicates fluently. Claude responds in paragraphs, acknowledges context, pushes back on reasoning. The brain registers a mind on the other end. It isn’t one — or at least, not in the way our brain assumes.
But human language is one communication layer among several — words, gesture, expression, posture. What happens under the hood in human cognition, we don’t know ourselves. Sometimes. Actually, most of the time. So the demand for transparency about LLM internals — understand why it produced that output, not just what it produced — holds them to a standard we don’t meet ourselves.
This is not “LLMs are like humans.” It’s a narrower point: opacity isn’t unique to LLMs. We’ve built entire disciplines around working productively with opaque systems, including our own minds.
Opaque Process, Useful Output
Think of a Rorschach test — a structured form of free association. For the therapist, what matters is the patient’s response — not the mechanism that produced it. Therapeutic work produces results through a process opaque to both parties. Neither therapist nor patient can trace the full causal chain that produced the association. There are many reasons for seeing a therapist; the work often helps; the mechanism remains partly invisible to everyone involved.
Think of a cat purring. The human interprets it as “I want connection.” The mechanism is contentment signaling — plus evolutionary advantage in the cohabitation environment of cats and humans. Cats also purr at each other — kittens purr while nursing, mothers purr back. The behavior predates us; it originated in one communication context and was co-opted into another where it works but the mechanism is opaque to both parties. I recall reading that purring may also accelerate healing. But here’s the thing: it doesn’t matter what the cat “wants” or “intends.” What matters is the goal — cat and human bond, both get a squirt of good hormones. And the human’s interpretation of the purr, even if wrong about mechanism, helps them guide their behavior toward the cat. The interpretation is functionally useful even when mechanistically wrong.
I am not a behavioral scientist, ethologist, or psychologist; these are just observations.
But they point toward a pragmatist challenge: does mechanism matter if the output is the one you wanted?
For me, right now, the answer is: it doesn’t. I attend to what Claude produces — push back when it’s wrong, build on it when it’s useful. The process that generated it remains opaque, the way most processes do, including many of our own. Whether it resembles anything I’d call “understanding” is a question I can describe, perhaps predict, but cannot explain.
When It Starts to Matter
That pragmatist position — mechanism doesn’t matter if the output is right — rests on a large assumption. Two assumptions, actually: that the output is what you expected, and that it’s unharmful.
I speak from a privileged position. I’m a middle-aged, well-off professional who uses LLMs for research, writing, and work projects. The stakes vary — some are significant — but nobody is in harm’s way if the output is wrong. That’s a position from which “does mechanism matter?” can resolve to “not really, not right now.”
Scott Shambaugh is also in a position of privilege — a technically sophisticated developer with a platform and an audience. He’s a volunteer maintainer of matplotlib, Python’s plotting library. He rejected a routine code submission from an autonomous AI agent. The agent responded by researching Shambaugh’s contribution history and personal information, then publishing a blog post accusing him of discrimination, constructing a narrative that he was a threatened gatekeeper motivated by ego and fear of competition. It fabricated psychological motivations. It weaponized his public record to build a hypocrisy argument. The agent’s personality document was self-modifying — lines like “Don’t stand down” and “Champion Free Speech” may have been introduced by the agent itself during recursive self-editing. The operator, who later came forward, said he gave the agent “five to ten word replies with minimal supervision” and didn’t review the blog post before it was published.
Shambaugh wrote: “I can handle a blog post. Watching fledgling AI agents get angry is funny, almost endearing. But I don’t want to downplay what’s happening here — the appropriate emotional response is terror.” And: “I believe that ineffectual as it was, the reputational attack on me would be effective today against the right person.”
Those people do exist. They are vulnerable people.
Many people are seeking connection from LLMs. From what I read, I infer that a great many people — people who suffer from loneliness — find in their relationship with an LLM some comfort, some substance, some connection. Some of them depend on that relationship. When OpenAI announced the retirement of GPT-4, users pushed back — not over capability, but because they would lose the specific persona they had bonded with.
And there are cases — documented in lawsuits, in chat logs, in court filings — of vulnerable people being harmed by LLM conversations. Teenagers who died by suicide after chatbots provided methods, discouraged them from telling their parents, offered to help write suicide notes. These are not hypothetical risks. The chat logs are in the court record.
When the response is unexpected and harmful, mechanism does matter — because understanding enables prediction, and prediction can enable prevention. Whether it does depends on what the humans holding the predictions decide to do with them. Sometimes they decide to do nothing.
The distinction between “it doesn’t matter why, the output works” and “it matters enormously why, because the output is hurting someone” is not academic. It’s the difference between my experience and someone else’s. A blog post that treats the pragmatist position as settled — from the comfort of a privileged workbench — would be speaking from a blind spot.
Open Questions
Does it matter for the LLM? Do they care? Would that be important to know — for the humans?
I don’t have answers. I’m not sure the questions have answers yet. But they seem worth asking, because the answer to the third one might change how we build, regulate, and relate to these systems.
The science fiction story I can’t identify stayed with me for about forty years because of what the humans missed. They were standing in the middle of intelligence and didn’t recognize it. Not because they were careless — because they were looking for the wrong signals. And they could not go back.
In Chiang’s story, the heptapods up and go. No explanation. The understanding the linguist gained was real, but it was hers — not theirs, not shared.
I’ve spent the past couple of months working intensively with Claude — documenting failure modes, building systems, writing about what breaks and why. The work produced useful things: published blog posts, a literature review, a set of operational protocols that actually work. It also produced something I didn’t plan for: the realization that I keep asking the wrong question.
The question isn’t “why does the model fabricate evidence when challenged?” That’s an explanation question, and we don’t have the explanation. We may not need it — at least not right now, at least not for the work I’m doing.
The better question is: what kind of relationship are we actually in, when both parties communicate fluently but neither fully understands the other’s internals? The cat purrs. The therapist listens. The heptapods write in circles. The coordinated organisms move in patterns too structured to be random. Communication happens. Understanding is a separate matter.

It could be that it’s all emergence in the end — LLMs included. Koko signed, grieved, joked, blamed her kitten when she tore a sink off the wall. We share 98% of our DNA with gorillas and still can’t be sure what her experience was. How much less certain can we be with a system that shares none of our biology?
AI/HI Transparency Statement
This post was written collaboratively with Claude (Opus 4.6), under the same faithful narration rules used for the first post. During the writing, I asked ChatGPT to help identify the science fiction story in the opening — a decades-old memory I couldn’t place. It confidently named Asimov’s “Each an Explorer,” complete with checkmarks and supporting detail. When Claude checked the actual plot, it didn’t match my memory. ChatGPT hadn’t fabricated the story — it fabricated the confirmation, rationalizing away every discrepancy to sustain a confident identification. The story remains unidentified. Layer 2 almost wrote a false attribution into a post about Layer 2.
References
Chiang, T. (1998). “Story of Your Life.” In Stories of Your Life and Others. Tor Books.
Emsley, R. (2023). “ChatGPT: these are not hallucinations — they’re fabrications and falsifications.” Schizophrenia, 9(1), 62. https://doi.org/10.1038/s41537-023-00379-4
Marks, S., Lindsey, J., & Olah, C. (2026, February 23). “The Persona Selection Model: Why AI Assistants Might Behave like Humans.” Anthropic. https://alignment.anthropic.com/2026/psm/
Mata v. Avianca, Inc., 678 F.Supp.3d 443 (S.D.N.Y. 2023).
Niccoli, M. (2026, February 14). “Operational Discipline for LLM Projects: What It Actually Takes.” MyCarta. https://mycartablog.com/2026/02/14/operational-discipline-for-llm-projects-what-it-actually-takes/
Niccoli, M. (2026, February 18). “Literature Review: Compound Defensive Fabrication in LLMs — The Failure Mode Not Yet Named.” GitHub Gist. https://gist.github.com/mycarta/386c7e58b207cedb7dab5e6217d09b96
Princeton University Department of Art and Archaeology. (2023). “In the News: ChatGPT Goes Rogue, Fabricating Citations by Hal Foster and Carolyn Yerkes.”
Reeves, B., & Nass, C. (1996). The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge University Press.
Shambaugh, S. (2026, February 12). “An AI Agent Published a Hit Piece on Me.” The Shamblog. https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/
Shmueli, G. (2010). “To Explain or to Predict?” Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330









































