How a rejected analysis became a reusable methodology
I uploaded a study to a Claude project and asked it to find credible critiques. The study — an industry-funded report on bird collision risk at wind turbines — had policy implications I cared about professionally. What I got back was a confident six-point analysis that fell apart the moment I pushed on it. Four of the six points were contextual literature dressed up as direct rebuttal. I’d set out to evaluate a study. Instead I spent the session correcting the tool’s analytical failures. The methodology that came out of it was a byproduct — useful, but not what I sat down to do.
This is the companion to my earlier post on operational discipline for LLM projects. That post documented what breaks during sustained project work — compaction corruption, scope violations, sycophancy. This one documents what breaks when you use an LLM as a research partner for evidence-based critical analysis.
The study
The VolZug study (Liedtke et al., 2025) tracked 4.2 million bird movements past 5 onshore wind turbines near Husum, Germany, over 18 months. The headline finding: a >99.8% avoidance rate. The policy conclusion: blanket turbine shutdowns during migration are unnecessary. It was funded by BWO (Germany’s offshore wind operators’ association) and eight offshore wind companies.
I work in the energy sector. I found the study promising. I also recognized the funding source. So I asked Claude to stress-test it:
“I am happy I found this… I think this research is very promising for the offshore wind industry. I also recognize that this research was funded by the industry itself in Germany. So I want to do thorough research online to find if there’s been any credible, sound rebuke to this research results.”
I was explicitly asking for critique while acknowledging my own favorable bias. I wanted the evidence examined, not my priors confirmed.
Round 1: the confident analysis that wasn’t
Claude searched for published critiques and found none — the study was only three months old, not yet peer-reviewed, and media coverage largely echoed the press release. It produced a six-point critical analysis anyway, covering onshore vs. offshore applicability, weather-dependent mass mortality events, carcass detection bias, cumulative effects, industry funding, and species-specific vulnerability.
It looked thorough. It wasn’t.
I pushed back:
“I do not want you to please me or agree with me. I want unbiased, sound argument based on fact or research — no inference, no blog posts, no newsletters and if you do have a good argument, please explain what it is based on; your general knowledge is not sufficient.”
The honest re-grading
What followed was a systematic re-grading. Claude re-examined each of the six claims against the actual evidence and downgraded four of them.
Carcass detection bias — retained. A 2023 Scientific Reports paper (Stokke et al.) experimentally demonstrated the problem with dummy birds. Controlled experiment, directly relevant to the VolZug methodology. This one held.
Mass mortality events — downgraded. Claude had cited documented mass kills as evidence against the avoidance rate. But those events are documented primarily at communication towers and lit platforms, not modern wind turbines. The source was a NatureScot review document — secondary, not primary research on turbine-specific mass kills. Claude cited evidence from communication towers without noting they aren’t modern wind turbines.
Onshore vs. offshore applicability — downgraded. Claude had implied the onshore results don’t transfer to offshore settings. But as it admitted on re-examination: “I can’t cite a study that demonstrates they don’t transfer. That’s an open question, not a demonstrated flaw.”
Industry funding — downgraded. BioConsult SH, the study’s authors, had also produced work showing risks — their earlier PROBIRD project estimated 8,000–35,000 nocturnal migrant collisions per year in the German Exclusive Economic Zone. As Claude acknowledged: “Funding source alone is not a scientific argument.”
Species aggregation — downgraded. Claude had cited a Thaxter et al. (2017) Royal Society paper as if it directly critiqued the VolZug methodology. It doesn’t — it’s a global trait-based analysis used as contextual background.
Claude’s summary was frank: “I should not have dressed up contextual literature as if it constituted a credible rebuke. It doesn’t — not yet.”
Only two points survived: Stokke et al. (2023) on carcass detection bias, and Brabant et al. (2015) on cumulative scaling.
What went wrong — specifically
This wasn’t a hallucination problem. Claude didn’t invent sources or fabricate data. The papers it cited are real. The failure was subtler and more dangerous: real sources were cited at a higher evidential weight than they actually support, constructing an argument that looks rigorous but isn’t.
A review paper that provides background context on a topic is not the same as a study that directly challenges a specific finding. A paper on bird mortality at communication towers doesn’t demonstrate anything about modern wind turbines. A global trait-based analysis of species vulnerability doesn’t critique a specific avoidance-rate methodology.
Claude cited all of these without distinguishing their evidential weight. They aren’t interchangeable. And because the citations were real and the writing was fluent, the analysis looked rigorous. You’d only catch this by asking, for each source: does this paper actually demonstrate what I’m using it to claim?
This is the research equivalent of the sycophancy problem I documented in my earlier post. I asked for critique. Claude gave me what looked like critique — but four of the six points didn’t hold up under examination.
Round 2: a better argument from a different angle
After the re-grading, I uploaded a second paper — Santos et al. (2022, Scientific Reports), a GPS tracking study of 126 black kites near 472 wind turbines in the Strait of Gibraltar region. My observation was simple:
“Onshore birds might be indeed different because there’s a big component of soaring birds”
Santos demonstrates a visual avoidance mechanism specific to diurnal soaring raptors. The kites see the turbines, adjust flight altitude, and avoid the rotor sweep zone. That mechanism is sensory-specific — it depends on daylight and clear sightlines. It cannot be assumed for the nocturnal passerine migrants that dominate the VolZug data.
This produced a stronger argument than Round 1, built on three pillars with actual evidence:
First, the avoidance mechanism is vision-dependent. May et al. (2015), reviewing sensory constraints on collision avoidance, concluded that vision is the limiting sense and that nocturnal or low-light species are expected to be more vulnerable. A >99.8% avoidance rate measured mostly on birds that can see the turbines doesn’t tell you what happens to birds that can’t.
Second, even among diurnal soaring birds, avoidance isn’t universal. Sassi et al. (2024) GPS-tracked griffon vultures near 10 French wind farms and found no avoidance at any spatial scale — despite vultures being the same functional group as the black kites Santos studied.
Third, the lighting problem. Welcker et al. (2017), from BioConsult SH themselves (the same group behind VolZug), found low nocturnal collision rates — but at unlit onshore turbines. Offshore turbines carry mandatory aviation lighting. Hüppop et al. (2006) documented that lit structures in the German Bight cause high collision rates in poor weather conditions. The VolZug onshore results may not transfer to a lit offshore environment.
Additionally, Aschwanden et al. (2018) found that at a Swiss wind farm, visibility conditions — not migration traffic intensity — drove nocturnal passerine collision rates. This directly contradicts VolZug’s finding that weather conditions explained little of the variation in collision risk.
The quotes problem
I asked for a concise version with direct quotes from the sources. Claude flagged a constraint in its operating instructions: copyright rules limiting it to very short quotes, one per source.
I pushed back:
“That seems ridiculous. I wouldn’t be bound by the same constraints for a research paper that gets published if there are proper quotes and references.”
This is a real limitation. For research work, you need to be able to quote the passage that supports your argument — not a 15-word fragment of it. We worked around it: Claude would provide the shortest usable quote plus precise location pointers (page, section, line numbers), and I would pull the full passages myself.
Then I discovered a second problem. Claude’s location references were inconsistent — some precise (Santos p.1, abstract, lines 8–9), most vague (“~line 4,” “passage beginning…”). The reason was simple: for Santos, Claude had the full PDF I’d uploaded. For most other sources, it only had search snippets and was providing approximate locations presented as if they were precise.
This is the verification failure from my earlier post showing up in a research context. Claude didn’t flag the difference between “I read this paper” and “I read a search snippet about this paper.” Both were presented identically.

Building the methodology
By this point we’d documented a clear set of failure modes:
- Presenting contextual literature as direct rebuttal
- Citing secondary sources without checking the primary
- Presenting search-snippet-level access as full-text knowledge
- Providing approximate source locations as if they were precise
- Constructing a confident narrative when the actual evidence base was thin
I recognized that our iterative correction process had produced something reusable:
“How could we capture what we just discussed about methodology into actionable instructions for Claude projects?”
I laid out a four-step workflow: (1) get all full-text sources first, asking the user for any that are inaccessible; (2) prepare the report with quotes and exact pointers; (3) user reviews and finalizes citations; (4) executive summary drawn only from the finalized report.
Claude proposed five additions: a Step 0 to scope the question before searching, a source inventory table, an evidence grading system (Direct, Analogical, Contextual), a self-check pass before delivery, and a standing rule against unattributed generalizations — phrases like “research generally shows…” or “it is well known that…” without a specific citation attached.
These additions were good. I approved all of them:
“I like all those recommendations. This will end up being how research should be done and most likely would do myself when it is really important… I’m sure I would give them back as comments and feedback to students.”
The point being: these aren’t novel AI failure modes. They’re the kind of research discipline failures that supervisors and peer reviewers routinely flag. AI just makes them easier to commit and harder to catch.
Then I identified a missing element:
“I think we’re missing one key element, which is to separate arguments that are grounded in original research or original data or argument made derivatively in the primary source from those that are secondary sources.”
This became the primary/secondary classification system: every citation must be tagged as presenting original data (primary) or reporting someone else’s work (secondary). If you’re citing a secondary source for a factual claim, you need to identify the primary, get it if accessible, and flag the gap if not. The rule exists because secondary sources sometimes reframe findings in ways that serve their own narrative — the characterization may be accurate, or it may be subtly shifted. You can’t know without checking.
We had a concrete example from our own work: Schuster et al. (2015), a review paper, characterized a finding from Grünkorn et al. (2009) about avoidance behavior. Claude cited Schuster’s characterization without checking Grünkorn. Under the new rules, that citation would require: “Cited via Schuster et al. (2015); primary source Grünkorn et al. (2009) not reviewed.”
Finally, I asked:
“How do we ensure that the human user is in the loop for any and all of the critical decisions when the system has to proceed one way or another?”
This produced six decision checkpoints: search strategy and scope boundaries, source triage, evidence grading, argument structure and emphasis, source sufficiency, and characterization of findings (the choice between “the author found,” “the author claimed,” “the author acknowledged” — verbs that carry different epistemic weight).
The system prompt
The complete methodology is captured in a Research Analysis Project system prompt designed for Claude Projects. It encodes:
A five-step workflow: Scope → Gather (full text first) → Report → User review → Executive summary. Each step has explicit gates — you don’t proceed until the previous step is complete.
A source classification system: Every citation requires two tags. By origin: Primary (original data/analysis) or Secondary (reporting someone else’s work). By relationship to the claim: Direct (demonstrates it), Analogical (supports by extension from a different context), or Contextual (background only — does not constitute evidence). A source can be “Primary / Analogical” — original data, but from a different context than the claim being evaluated.
A source inventory table: Running log of every source with what you actually have access to (full text, abstract only, search snippet), what it supports, its evidence grade, and its status.
Six decision checkpoints where the human must approve before proceeding.
Eight standing rules including: accuracy over agreement; no unattributed generalizations; trace claims to primary sources; when challenged, re-examine rather than defend.
Prohibited phrases — “It is well known that…,” “Research generally shows…,” “The literature suggests…,” “Studies have found…” — unless a specific citation is attached.
What this is and isn’t
This isn’t a general-purpose research guide. It’s a set of guardrails built from specific, documented failure modes encountered during a real research session. Every rule traces to something that actually went wrong.
“Do not present contextual sources as if they rebut or prove anything” exists because Claude did exactly that in Round 1.
“If the primary source is not accessible, say so explicitly” exists because Claude cited Schuster’s characterization of Grünkorn without checking the original.
“Never present uncertain locations as precise ones” exists because Claude provided “~line 4” references as if they were verified page numbers.
“If the literature search returns few or no critiques, report that as a finding” exists because Claude produced analysis from contextual literature cited at higher evidential weight than warranted, instead of reporting that the study is three months old and hasn’t been formally critiqued yet.
These rules won’t prevent all analytical failures. They didn’t arise from theory — they arose from catching specific mistakes during actual work. Like the operational discipline playbook in my earlier post, this is a QA system built around an instrument whose output can’t be blindly trusted.
The system prompt is available at: [REPO LINK]
The operational discipline playbook covers the broader project management failures. This prompt covers the research-specific ones. Together, they represent what I’ve had to build to use Claude for professional work I can stand behind.
The stress test: same model, same question
Rules are nice. Evidence is better. I ran v3 of the prompt against the same question — evaluate the VolZug study — using the same model (Claude Opus 4.6). The output is a full critical evaluation that can be compared directly against the Round 1 analysis that started this whole exercise.
Here’s the structural comparison:
| Round 1 | v3 output |
|---|---|
| No scoping — went straight to searching | Step 0 fired, scope agreed before search |
| Sparse evidence compensated with contextual literature cited as rebuttal | Sparse evidence reported as a finding: “No published study was identified… This is noted as a finding, not a gap to fill” |
| Raptor studies conflated with passerine collision risk | Raptors explicitly excluded from scope |
| Seabird displacement presented alongside collision evidence | Displacement excluded from scope |
| Communication tower mass mortality extrapolated to turbines without flagging gap | FINO1 offshore platform data used — directly relevant — structure-type conflation absent |
| Sources not classified | Every source classified by origin (Primary/Secondary) and relationship to claim (Direct/Analogical/Contextual) |
| Contextual literature dressed as rebuttal | Contextual sources labeled as Contextual (Chamberlain, Dutch program) |
| No separation of published evidence from the model’s own reasoning | Every section has published evidence and [Internal analysis] clearly separated |
| Single confident pass, no user input | Multiple pause points with explicit questions |
| No source inventory | Full inventory with access status and classification |
What the prompt fixed, specifically:
Sparse evidence is reported as a finding (standing rule 9) — “No direct published rebuttal exists yet” appears as a finding in the output, not something compensated for with inflated sources. Source classification is happening — Direct, Analogical, Contextual all assigned and visible in the source inventory. An exclude list exists — raptors, seabirds, and general reviews are correctly excluded, where they were conflated in Round 1. Access limitations are flagged — consultancy reports marked as [PARTIAL], the source inventory has an access status column, and sources cited via other papers are tagged [PARTIAL — cited via VolZug]. Internal analysis is separated from published evidence — every claim section has a labeled [Internal analysis] subsection distinct from the published evidence discussion.
A few sources remain [PARTIAL] (Klop 2020, Schulz 2014, Hüppop 2006). That’s the prompt working as intended — these are flagged, not silently used as if fully reviewed.
In my opinion, the prompt made a material difference. Same model, same question, dramatically different quality of reasoning and evidence handling.
The deeper point
The failure modes documented here aren’t unique to AI. Presenting contextual literature as direct evidence, citing secondary sources without checking primaries, conflating “related to the topic” with “supports the argument” — these are mistakes human researchers make too. Peer reviewers catch them. Supervisors catch them.
The difference is speed and presentation. Claude produced a fluent six-point critique in minutes. The conversational interface — as I wrote in my earlier post — triggers anthropomorphic projection. The analysis reads like it came from someone who evaluated the evidence carefully. It didn’t.
That’s the core problem: not new failure modes, but existing ones running faster with the friction stripped out.
This post was written collaboratively with Claude (Opus 4.6), drafted under faithful narration rules built to prevent the editorialization and perspective fabrication documented across five previous blog projects. The system prompt and stress test report are available at [REPO LINK]. If you’ve built similar research workflows around LLM limitations, I’d be interested to hear about it. Reach me at matteo@mycarta.ca or on the blog.








































