Confidence intervals and prediction intervals in OLS regression: a geoscience worked example

Featured

Posted on March 3, 2026 by matteomycarta

Introduction

I recently released an open source research bullshit detector. I ended up doing some house cleaning in he repo Data-science-tools-petroleum-exploration-and-production. The result is this new notebook — available in a teaching-oriented version and a practitioner-oriented version on GitHub — that walks through the distinction between regression confidence interval (CI) and the prediction interval (PI), using a real petroleum geology dataset.

When you fit an OLS regression to well data and plot the result, the output typically includes an uncertainty band around the regression line. That band can represent two very different questions, depending on how it is computed. One question is: “Where does the average production lie, for wells with a given gross pay?” The other is: “What production should we expect from the next individual well we drill?” These are not the same question, and conflating the two can lead to significantly different conclusions in a drilling decision context.

The two intervals

The confidence interval (CI) captures uncertainty about where the true regression line lies. Because our sample is limited, the estimated line is just one of many possible lines we could have obtained. The CI narrows as sample size increases, and answers: “What is the average production for wells at this gross pay value?”

The prediction interval (PI) captures uncertainty about where a new individual observation will fall. Even if the true regression line were known exactly, individual wells would still scatter around it due to natural variability. The PI always includes that residual scatter on top of parameter uncertainty — so it is always wider than the CI, and retains an irreducible minimum width even with infinite data.

Mathematically, the only difference between the two formulas is a +1 under the square root in the PI expression. That extra 1 represents the variance of a single new observation around the mean — what the notebook calls the irreducible scatter.

In statsmodels, both intervals come out of a single call: results.get_prediction().summary_frame(alpha=0.05), with the CI in columns mean_ci_lower / mean_ci_upper and the PI in obs_ci_lower / obs_ci_upper.

The dataset

The data comes from Lee Hunt’s (2013) paper Many correlation coefficients, null hypotheses, and high value (CSEG Recorder, December 2013). It contains measurements from 21 wells producing from a marine barrier sand, with variables including gross pay (m), porosity-height, position within the reservoir, pressure draw-down, and production in tens of barrels per day. Gross pay is the strongest single predictor of production (r = 0.87), so that is the starting point.

Where the difference matters: economic risk

The practical value of the CI vs. PI distinction becomes concrete when an economic cutoff is added. In the notebook the minimum economic production is set at 20 (tens of bbl/d), and the question is: what minimum gross pay should be required before drilling?

Looking at the regression line alone, ~3.5 m of gross pay looks sufficient — the predicted mean production at that thickness crosses the threshold. But the PI lower bound tells a different story: to have 95% confidence that the next well drilled will exceed the economic cutoff, approximately 12 m of gross pay is needed. The difference between 3.5 m and 12 m is enormous in practical terms — it could determine whether a prospect gets drilled at all. The figure below shows this directly.

Effect of sample size

The analysis is repeated with only 5 wells, representing an early appraisal scenario. The PI widens substantially, and the required minimum gross pay shifts upward again. As Hunt (2013) notes: the path forward is to either accept the uncertainty or work to reduce it — drill more wells, incorporate additional seismic data, and so on.

Adding predictors

In practice, production depends on more than gross pay. Adding Position and Pressure to the model — two physically meaningful predictors — improves R² and reduces the residual standard error. A partial-effect plot (holding Position and Pressure at their mean values, varying Gross pay) shows the multivariate PI is visibly narrower than the bivariate one. The side-by-side comparison carries the title “Adding Predictors Narrows the Prediction Interval.”

Closing

The key point is stated directly in the notebook: when assessing risk for the next well, reach for the PI, not the CI. The regression line and the CI answer a different question than the one a drilling decision requires.

Picobot Revisited: Optimizing a Tiny Robot’s Rules, Ten Years Later

Featured

Posted on January 31, 2026 by matteomycarta

A decade-old folder, handwritten notes, and a deceptively simple robot.

Introduction

Wrapping up a third personal fun project in two months? Check!! And this is the longest-standing one, and possibly one of my favourite ever. It goes back to when I was barely past the first steps into my exploration of both Python, and Computer Science. This project was fun because it had to do with solving puzzles. I am happy to share it with you, my readers, today.

If you’ve ever watched a Roomba bump into a wall, spin around, and trundle off in a seemingly random direction, you’ve witnessed a real-world version of the problem I’m about to describe. How does a robot that can only sense what’s immediately around it — no map, no memory of where it’s been, no grand plan — manage to cover every square inch of a room?

In January 2015, I was working through Harvey Mudd College’s “CS for All” materials on my own — no live instruction, no solutions to check against — and I encountered Picobot: a simulated robot even simpler than a Roomba. Picobot became one of my favourite puzzles. I scribbled diagrams, wrote copious amounts of notes, tested rules, and eventually optimized my solutions down to what I believed were the minimum number of rules needed to cover the whole room. I kept everything into a well-worn file folder. This was my very first serious dab into CS, and I loved it!

That folder has survived multiple reorganizations over the years – every once in a while I’d open it, think about writing it up properly, and close it again. But, after positive experience wrapping up projects collaboratively with Claude — the colormap app, the Mill’s Methods post — Picobot was next in line.

With the help of Claude Opus (v 4.5) I verified those old solutions, built a Python simulator, and finally documented the work properly.

This post is about the optimization journey. The reasoning. The moments when things click.

What is Picobot?

Picobot is a pedagogical robot created for Harvey Mudd’s introductory computer science course. It lives in a grid world and has one job: visit every empty cell. The catch? Picobot is nearly blind.

The Constraints

Picobot can only sense its four immediate neighbours: North, East, West, and South. For each direction, it knows one thing: is there a wall, or is it empty? That’s it. No memory of where it’s been. No coordinates. No global view.

Here’s an example of what Picobot “sees”:

    N
  W ● E    ←  Picobot sees: N=empty, E=wall, W=empty, S=empty
    S

We encode this as a 4-character string: xExx

x means empty (nothing there)
N, E, W, or S means wall in that direction
Position order is always: North, East, West, South

So xExx means “wall to the East, everything else empty.”

The Rules

Picobot follows rules that say: “If I’m in this state and I see this pattern, then move this direction and switch to this state.”

The format is:

STATE  SURROUNDINGS -> MOVE  NEW_STATE

For example:

0 Nx** -> E 1

This means: “In State 0, if there’s a wall to the North and East is empty, move East and switch to State 1.”

The wildcard * matches anything:

0 x*** -> N 0

“In State 0, if North is empty (don’t care about the rest), move North and stay in State 0.”

There’s also a special move: X (stay put). The robot doesn’t move but can change state. This seems useless at first. It’s not.

The Goal

Write the smallest set of rules that makes Picobot visit every empty cell in a room, regardless of where it starts.

The Harvey Mudd Picobot lab posed two main challenges, below, and several optional one.

Empty Room: A rectangular room with walls only on the boundary
Maze: A maze with single-cell-wide corridors

The lab simulator is actually still live at https://www.cs.hmc.edu/picobot/

Give it a shot, it’s fun!

Back to the story.

The Empty Room: From 7 to 6 Rules

The Strategy: Boustrophedon

The word comes from Greek: “ox-turning.” It’s how you plow a field — go one direction, turn around at the end, come back the other way. Mow a lawn. a line of text, then the next (if you are Etruscan).

For Picobot, the boustrophedon pattern looks like this:

The robot sweeps East, drops down, sweeps West, drops down, repeats. But first, it needs to get to the top of the room — so it goes North until it hits the wall.

My Initial Solution: January 6, 2015

I have an email I sent to myself at 12:44 AM on January 6, 2015 — working late (on a Tuesday night!!!) on this puzzle. It shows my first experiments:

First experiment: go to origin:
# go to origin
0 **** -> X 3
3 ***x -> S 3
3 ***S -> W 2 
2 **x* -> W 2
2 **W* -> X 0

And then my first complete solution:

Final solution program 1
0 x*** -> N 0     # (initial) state 0 with nothing N: go N
0 Nx** -> E 1     # state 0 with a wall N but none E: go E, AND

1 *x** -> E 1     # state 1 with nothing E: go E
# OR, instead of previous 2. This is if initially by E wall
0 NE** -> W 2     # state 0 with a wall N and one E: go W

# once it reaches east wall
1 *E** -> W 2     # state 1 with a wall E: go W
2 **x* -> W 2     # state 2 with nothing W: go W
2 **W* -> S 1     # state 2 with a wall W: go S

That’s 7 rules. The comments show my thinking — I was handling the case where Picobot starts by the East wall separately.

The Harvey Mudd lecture slides posed an extra challenge: “how FEW rules can you use? The current record is six rules” The solution wasn’t shown — just the target. That became the question that hooked me: how do you get there? I was one rule away

The Insight: “C and F Are the Same”

My handwritten notes show positions labelled A through F, representing different situations Picobot might encounter. The breakthrough came when I realized:

Position C (just finished going North, need to decide: East or West?) and Position F (at a wall during the sweep, need to decide direction) were being handled by separate rules — but they didn’t need to be.

The key insight: after going North and hitting the wall, I don’t need a separate rule to check East. I can use the X move (stay put) to transition to State 1, and let State 1’s existing rules handle it.

This is counter-intuitive. The X move looks like wasted time — the robot just sits there! But it’s not wasted. It’s a state transition without movement that lets me reuse existing rules instead of duplicating logic.

The Final Solution: January 24, 2015

Eighteen days later, I emailed myself the optimized solution — Saturday, January 24, 2015 at 5:05 PM (weekend fun work):

# Optimized EMPTY ROOM program:
0 x*** -> N 0
0 N*** -> X 1
1 *x** -> E 1
1 *E** -> W 2
2 **x* -> W 2
2 **W* -> S 1

Six rules. Let me walk through why this works:

State 0 handles “going North.” When Picobot hits the North wall, it executes X 1 — stays put but switches to State 1. Now State 1 takes over.

State 1 is dual-purpose:

If East is empty → go East (continuing the sweep)
If East is wall → start going West (end of row)

Because Picobot stays put when transitioning from State 0 to State 1, it’s in the exact same position, and State 1 correctly determines whether to go East or start heading West.

State 2 sweeps West. When it hits the West wall, it goes South and switches back to State 1. Again, State 1 determines: East or end of row?

The elegance is that State 1 does double duty. It handles both “continue going East” and “decide what to do at the end of a row.” The X move is what makes this possible.

Verified

I tested this against all 529 possible starting positions in a 25×25 room. Every single one reaches 100% coverage. Maximum steps: 1,013. The solution works.

The Maze: From 16 to 12 Rules

The maze challenge is different. Corridors are one cell wide. There are dead ends, branches, and loops. The boustrophedon strategy won’t work here.

The Strategy: Right-Hand Wall Following

The classic maze-solving algorithm: keep your right hand on the wall and walk. You’ll eventually visit everywhere (in a simply-connected maze).

For Picobot, “right hand on wall” translates to:

If you can turn right, turn right
Otherwise, if you can go forward, go forward
Otherwise, if you can turn left, turn left
Otherwise, turn around (dead end)

With four directions (North, East, West, South) and the “right-hand” rule relative to each, we need four states — one for each direction Picobot is “facing.”

State 0: Going North (right hand on East wall)

State 1: Going East (right hand on South wall)

State 2: Going West (right hand on North wall)

State 3: Going South (right hand on West wall)

Initial Solution: 16 Rules

The straightforward implementation uses 4 rules per state:

# State 0: Facing North (right hand = East)
0 *x** -> E 1      # Can turn right → turn right (now facing East)
0 *Ex* -> N 0      # Can't turn right, but forward is open → go North
0 *EW* -> W 3      # Can't go forward → turn left (face West)  
0 *EWS -> S 2      # Dead end → turn around (face South)

# ... and similarly for States 1, 2, 3

16 rules total. It works. But can we do better?

Two-Phase Optimization

My maze notes show two distinct approaches:

Phase 1: Working from principles. The small diagram in my notes shows me reasoning about the state transitions theoretically. What’s the minimum information needed at each decision point? Where is there redundancy?

Phase 2: Empirical debugging. The large diagram shows positions A through K — specific spots in a maze where I tested rules. When the principled approach hit edge cases, I sketched the situation, walked through it (“what would I do here?”), and translated my intuition into rules.

The note “Key is G” appears on the page. Position G was where the solution got validated — when it handled G correctly, the logic was proven.

The Iteration: A Failed Attempt

That same January 24 email shows me trying to adapt the empty room optimization for the maze — and failing:

This, optimized for maze, does not work. At dead ends it turns around but then it goes to the other end and enters an infinite loop...

The attempt that followed didn’t handle dead ends properly. The robot would turn around, walk to the other end, and loop forever.

The Final Solution

Then, in the same email:

This works!!
0 *x** -> E 1
0 xE** -> N 0
0 NE** -> X 2
1 ***x -> S 3
1 *x*S -> E 1
1 *E*S -> X 0
2 x*** -> N 0
2 N*x* -> W 2
2 N*W* -> X 3
3 **x* -> W 2
3 **Wx -> S 3
3 **WS -> X 1

12 rules: 3 per state instead of 4. A 25% reduction.

The key insight: each state now handles only three cases:

Right is open → turn right
Forward is open → go forward
Both blocked → stay put, rotate to next state (which will check left/behind)

The X move chains states together. If right and forward are blocked, we stay put and try the next state. That state checks its right (our left). If that’s blocked too, it chains again. The sequence continues until we find a way forward.

Verified

Tested against all 287 reachable positions in a 25×25 maze, and all 280 cells in the actual Harvey Mudd lab maze. 100% coverage every time. Here’s one simulation:

The right-hand rule doesn’t just guarantee coverage — it collapses the state space. The rules are ordered to check “right side open” first. In State 0 (facing North), rule 1 asks: is East open? If yes, go East — Picobot never evaluates what’s ahead. That’s how rule ordering implements “keep your hand on the wall.” Different physical positions with the same wall-relationship become equivalent, and that’s what makes 4 states and 12 rules possible. Take a look at the simulations below of the two equivalent positions sketched in my handwritten notes, shown earlier:

Making It Explicit: Starting State Matters

Here’s something worth highlighting — something that’s in the Harvey Mudd lab instructions but easy to overlook.

The 6-rule empty room solution requires Picobot to start in State 0.

The Harvey Mudd simulator always starts in State 0, and the lab materials mention this. Whether I consciously accounted for this in 2015, I don’t remember — I didn’t document it in my notes. But when I built my own simulator in 2025, I could test explicitly: what happens if Picobot starts in State 1 or State 2?

Start State	Initial Direction	Coverage
0	North	100% ✓
1	East	~50% ✗
2	West	~45% ✗

Starting in State 1 or 2, Picobot gets stuck. It begins the East-West sweep from wherever it starts — never going North to reach the top first. The rows above its starting position never get visited.

This isn’t a bug in the solution. It’s a constraint: the boustrophedon pattern assumes you start by going North. The 6-rule minimum only works because State 0 guarantees that first trip to the top wall.

A truly state-agnostic solution — one that works regardless of starting state — would need more rules. The elegance of 6 rules comes from working within the standard initial conditions.

What I Learned

The X move is not wasted time. It’s a state transition that enables rule reuse. This is the key to minimizing rule count.
Different problems, different methods. The empty room yielded to analytical insight (“C and F are the same”). The maze required two phases: principled derivation, then empirical debugging.
Implicit assumptions matter. The starting state requirement was in the lab materials all along, but easy to overlook. Building my own tools made it explicit.
Old projects are worth revisiting. With fresh eyes — and some help — I found new ways to understand and share work I already knew.
How I approached it. Looking back at my notes, I see a pattern that’s familiar from my day-to-day work: diagrams everywhere, positions A-K labeled, “me walking in the maze.” Try something → watch where it fails → sketch that spot → ask “what would I do here?” → translate to rules → repeat. “C and F are the same” collapsed the problem by seeing equivalence the formal notation obscured. The notes weren’t just records — they were how I thought. And 18 days between 7 rules and 6 rules: no rushing, no giving up. This is field scientist methodology applied to computer science. Maybe that’s why I loved it.
There is no free lunch in AI collaboration. This project — both the technical verification and this blog post — would not have been possible without deep understanding of the subject matter. That understanding came from me (the 2015 work, the insights, the diagrams), from the extensive documentation I’d kept, and from all the iterative work we did together. This isn’t “vanilla coding” where you prompt an AI and get a finished product. It’s genuine collaboration: human insight plus AI execution. The AI didn’t optimize Picobot — I did, in 2015. The AI helped me verify, document, and communicate that work in 2025.

Try It Yourself

The full Python implementation is on GitHub: https://github.com/mycarta/picobot-optimizer

Itncludes:

picobot_simulator.py — The core engine
picobot_rooms.py — Empty room and maze generators
picobot_visualizer.py — GIF animation creator
optimized_solutions.py — The 6-rule and 12-rule solutions
test_solutions.py — Exhaustive verification

All documented and ready to explore.

What’s Next

Part 2: How I revisited this project with AI assistance — and what that collaboration actually looked like.

Part 3: Educational materials. Exercises, concept checks, and scaffolded challenges for those learning to code.

The Picobot simulator was created for Harvey Mudd College’s “CS for All” course. My optimization work is from January 2015. Verification, documentation, and visualization were completed in January 2025 with AI assistance.

AI/HI (Human Intelligence) Transparency Statement

Modified from Brewin

Has any text been generated using HI?	Yes
Has any text been generated using AI?	Yes
Has any text been improved or corrected using HI?	Yes
Have any methods of analysis been suggested using HI?	Yes
Have any methods of analysis been suggested using AI?	Yes
Do any analyses utilize AI technologies, such as Large Language Models, for tasks like analyzing, summarizing, or retrieving information from data?	Yes

Additional context:

The Picobot optimization work described in this post — the solutions, the insights, the handwritten diagrams, the reasoning behind “C and F are the same” and “Key is G” — was done entirely by me in January 2015, working alone through Harvey Mudd’s CS for All materials with no live instruction and no solutions to check against. The emails quoted in this post are timestamped records from that work.

In January 2025, I revisited this project with Claude AI (Anthropic). Claude built the Python simulator, ran exhaustive verification tests, created the GIF visualizations, and helped document the reasoning. The explicit testing of starting states emerged from our joint exploration — I asked the question, Claude ran the tests.

This post was drafted collaboratively. I provided the source materials (my 2015 notes, emails, the verified solutions, our session transcripts), direction, and editorial judgment throughout. Claude drafted based on these inputs and our discussion of structure and framing. I reviewed, revised, and made all final decisions about what went to publication.

A note on AI collaboration: This kind of work is not “vanilla coding” — prompting an AI and receiving a polished output. It required deep domain knowledge (mine), extensive primary documentation (my 2015 notes and emails), iterative correction (many rounds), and genuine intellectual engagement from both sides. The AI contributed too — not the original insights, but meta-insights: recognizing patterns in my notes, naming things I’d done but hadn’t articulated (like “C and F are the same” as a key moment), and seeing that I’d used different methodologies for the empty room versus the maze. The AI did not and could not have done this alone. Neither could I have done the verification, visualization, and documentation at this scale without AI assistance. That’s what real collaboration looks like.

The intellectual work is mine. The documentation, verification, and articulation is collaborative.

Modernizing Python Code in the AI Era: A Different Kind of Learning

Featured

Posted on January 24, 2026 by matteomycarta

A few years ago I wrote about advancing my Python coding skills after working through a couple of chapters from Daniel Chen’s excellent book Pandas for Everyone. In that post I showed how I improved code I’d written in 2018 for the SEG Machine Learning contest. The original code used unique() to get lists of well names, then looped through with list comprehensions to calculate flagged samples and proportions. The 2020 version replaced all that with groupby() and apply(), making it much more compact and Pythonic. For example, where I’d written a list comprehension like [result_a.loc[result_a.zone==z,'flag'].sum() for z in zones_a], I could now write simply result_a.groupby('zone', sort=False).flag.sum().values. The runtime also improved – from 86ms down to 52ms. I remember being quite happy with how much cleaner and more readable the code turned out, and how the learning from those two chapters made an immediate practical difference.

Recently, I had to modernize the Busting bad colormaps Panel app, which I built back in 2020 to demonstrate colormap distortion artifacts (something that – as you know – I care a lot about). The app had been deliberately frozen in time – I’d pinned specific library versions in the environment file because I knew things would eventually become obsolete, and I wanted it to stay functional for as long as possible without having to constantly fix compatibility issues.

But some of those issues had finally caught up with me, and the app had ben down for soem time. Last fall, working with Github copilot, I fixed some matplotlib 3.7+ compatibility problems – replace the deprecated cm.register_cmap() with plt.colormaps.register(), fix anrgb2gray error, and resolve a ValueError in the plotting functions.

But the deployment was also broken. In 2021, mybinder.org had switched to JupyterLab as the default interface, changing how apps needed to be deployed. Panel developers had to adapt their code to work with this new setup. The old Panel server URL pattern no longer worked. I tried to figure out the new URL pattern by browsing through the Binder documentation, but I couldn’t make sense of it and failed miserably. It was a short-lived effort that pushed me toward trying something different: full-on coding with Claude Opus 4.5 using Copilot in VSCode.

That’s what allowed me, this month, to complete the modernization process (though honestly, we still haven’t fully sorted out a Binder timeout issue).

A step back to 2020: Building the app from scratch

When I originally built the colormap app, I coded everything myself, experimenting with Panel features I’d never used before, figuring out the supporting functions and visualizations. I also got very good advice from the Panel Discourse channel when I got stuck.

One issue I worked on was getting the colormap collection switching to behave properly. After the first collection switch, the Colormaps dropdown would update correctly, but the Collections dropdown would become non-responsive. With help from experts on the Discourse channel, I figured out how to fix it using Panel’s param.Parameterized class structure.

2026: Working with Claude

The second, and hardest part of the modernization was done almost entirely by Claude Opus. Here’s what that looked like in practice:

Binder deployment: Claude independently figured out the new JupyterLab URL pattern (?urlpath=lab/tree/NotebookName.ipynb instead of the old ?urlpath=%2Fpanel%2FNotebookName). Only later, when fact-checking for this post, did we discover the history of Binder’s 2021 switch to JupyterLab and how Panel had to adapt. This helped, though we’re still working through some timeout issues.

Environment upgrade: Claude upgraded to Python 3.12 and Panel 1.8.5, bringing everything up to modern versions. The key packages are now Panel 1.8.5, param 2.3.1, and bokeh 3.8.1.

Code modernization: Claude spotted and fixed deprecated API calls – the style parameter for Panel widgets became styles.

Collection switching – Claude’s breakthrough: This was Claude’s biggest solo contribution. The collection switching broke during the update, and Claude independently diagnosed that the class-based param.Parameterized approach that had worked in Panel 0.x wasn’t reliable in Panel 1.x. Without me having to guide the solution, Claude figured out how to rewrite it using explicit widgets with param.watch callbacks.

The comparison shows the change:

The new approach uses explicit widget objects with callback functions, which works more reliably in Panel 1.x than the class-based parameterized approach.

New features: Claude integrated two new colormap collections I’d been wanting to add for years – Fabio Crameri’s scientific colormaps (cmcrameri) and Kristen Thyng’s cmocean colormaps. That brought the total from 3 to 5 colormap collections.

Here are examples of the app showing each of the new collections:

The app testing of **cmocean deep** colormap

The app testing of Crameri’s **batlow** colormap

Documentation: Claude updated the README with detailed step-by-step Binder instructions, added a troubleshooting section, and created a table documenting all five colormap collections.

I provided the requirements and guidance throughout, but I almost never looked at the implementation details – what I’ve taken to calling the “bits and bobs” of the code. I focused on what I needed to happen, Claude figured out how to make it happen.

What changed (and what didn’t)

I still understand what the code does conceptually. I can read it, review it, check that it’s correct. I know why we needed to move from Parameterized classes to explicit widgets, and I understand the reactive programming model. But I didn’t write those lines myself.

The work happens at a different level now. I bring the domain expertise (what makes a good colormap visualization), the requirements (needs to deploy on Binder, needs these specific colormap collections), and the quality judgment (that widget behavior isn’t quite right). Claude brings the implementation knowledge, awareness of modern best practices, and the ability to quickly adapt code patterns to new frameworks.

This is really different from my 2020 experience. Back then, working through those Pandas patterns taught me techniques I could apply to other projects. Now, I’m learning what becomes possible when you can clearly articulate requirements and delegate the implementation.

The honest trade-off

There’s a trade-off here, and I’m trying to be honest about it. In 2020, working through the Panel widget patterns taught me things that stuck. In 2026, I got working, modernized code in a fraction of the time, but with less hands-on knowledge of Panel 1.x internals.

For this particular project, that trade-off made sense. I needed a working app deployed and accessible, not deep expertise in Panel migration patterns. But I’m conscious that I’m optimizing for different outcomes now: shipping features fast versus building deep technical understanding through hands-on work.

What this means going forward

After years of writing code line by line, this new way of working feels both efficient and different. I got more done in a couple of hours than I might have accomplished in several weeks working solo. The app is modernized, deployed, working better than ever, and even has new features I’d been wanting to add for years.

This has been a gamechanger for how I work. I still do the work that matters most to me: seeing the tool gap, coming up with the vision, iteratively prototyping to flesh out what I actually need. That’s substantial work, and it’s mine. But after that initial phase? A lot of the implementation will be done with Claude. The app is done and it’s great, and I know this is the path forward for me.

References

Chen, D.Y. (2018). Pandas for Everyone: Python Data Analysis. Addison-Wesley Professional.

Crameri, F. (2018). Geodynamic diagnostics, scientific visualisation and StagLab 3.0. Geoscientific Model Development, 11, 2541-2562. https://www.fabiocrameri.ch/colourmaps/

Niccoli, M. (2020). Keep advancing your Python coding skills. MyCarta Blog. https://mycartablog.com/2020/10/22/keep-advancing-your-python-coding-skills/

Thyng, K.M., Greene, C.A., Hetland, R.D., Zimmerle, H.M., and DiMarco, S.F. (2016). True colors of oceanography: Guidelines for effective and accurate colormap selection. Oceanography, 29(3), 9-13. https://matplotlib.org/cmocean/

Try the app yourself: The modernized colormap distortion app is available on GitHub and you can run it in Binder without installing anything.

The value of intellectual play: Mill, machine learning, and a drilling problem I couldn’t stop thinking about

Featured

Posted on January 20, 2026 by matteomycarta

A few years back, I watched a CSEG talk by Lee Hunt (then at Jupiter Resources) called Value thinking: from the classical to the hyper-modern. One case study in particular stuck with me—so much so that I ended up exploring it in a Jupyter Lab notebook, bringing it up in a job interview, and eventually testing whether an AI could reason through it on its own.

This post is about that journey. It’s also about what happens when you let yourself get genuinely curious about someone else’s problem. And—fair warning—it involves a 19th-century philosopher, a seven-well dataset, and a neural network that learned to distrust AVO attributes.

The problem

Jupiter Resources had a history of occasionally encountering drilling trouble in the Wilrich reservoir—specifically, loss of circulation when encountering large systems of open fractures. Mud loss. The kind of problem that can cost you a well.

They had done extensive geophysical work with multiple seismic attributes that, in theory, should correlate with fractures: Curvature, Coherence, AVAz (amplitude variation with azimuth), VVAZ (velocity variation with azimuth), and Diffraction imaging. But they lacked direct calibration data for the drilling problem, and some of the attributes were giving conflicting results.

Lee Hunt, who led the team and the geophysical work, suspected from the start that the AVO-based attributes might be compromised. He had seen evidence as far back as 2014 that AVAz and VVAZ responses in the Wilrich were dominated by an overlying coal, not the fractures themselves—the attributes were measuring a different geological signal entirely. Diffraction imaging was planned early as a complementary measure, precisely because it might not be affected by the coals in the same way (personal communication).

Seven wells. Five attributes. Four of the wells had experienced drilling problems; three had not. Here’s the data:

The question: which attribute—or combination—could reliably predict drilling problems, so that future wells could be flagged ahead of time?

Mill’s Methods: 19th-century philosophy meets drilling risk

Rather than accept uncertainty and provide no geophysical guidance at all, the team at Jupiter tried something different: Mill’s Methods of Induction. Their goal was to find a pattern that could help them advise the operations team—flag high-risk well locations ahead of time so contingency plans could be in place. Mill’s Methods are a set of logical procedures for identifying causal relationships, laid out by philosopher John Stuart Mill in 1843. They’re often illustrated with a food poisoning example (who ate what, who got sick), but they work just as well here.

This approach was characteristic of Lee Hunt’s attitude toward quantitative geophysics—an attitude I had come to admire through his other work. A few years earlier, he had published a CSEG Recorder column called “Many correlation coefficients, null hypotheses, and high value,” a tutorial on statistics for geophysicists that included synthetic production data and an explicit invitation: “You can do it, too. Write in to tell us how.”

I took him up on it. I worked through his examples in Jupyter notebooks, built visualizations, explored prediction intervals, learned a good deal of scientific computing along the way. I reached out to him about the work. I even wrote up some of that exploration in a blog post on distance correlation and variable clustering—the kind of technical deep-dive where you’re learning as much about the tools as about the data. That extended engagement gave me a feel for his way of thinking: understand the statistics, accept the uncertainty, improve your techniques if you can—but don’t just throw up your hands when the data is messy.

Method of Agreement: Look at all the problem wells (A, B, F, G). What do they have in common? Curvature is TRUE for all four. So is Diffraction imaging. The other attributes vary.

Method of Difference: Compare problem wells to non-problem wells (C, D, E). Neither Curvature nor Diffraction alone perfectly discriminates—Well E has Curvature TRUE but no problem; Well D has Diffraction TRUE but no problem.

Joint Method: But here’s the key insight—Curvature AND Diffraction together form a perfect discriminator. Every well where both are TRUE had problems. Every well where at least one is FALSE did not.

This wasn’t a claim about causation. It was a decision rule: when the next well location shows both high curvature and diffraction anomalies, flag it as elevated risk and ensure contingency protocols are in place.

The logic is sound because of asymmetric costs. Preparing for mud loss (having lost circulation material on site, adjusting mud weight plans) is a minor expense. Not preparing when you should have—that’s where you lose time, money, sometimes the well. You don’t need certainty to justify preparation. You need a defensible signal.

What a neural network learned

I wanted to see if a data-driven approach would arrive at the same answer. Looking at the table myself, and spending some time applying Mill’s Methods, I had already seen the pattern—Curvature and Diffraction together were the key predictors. But I was curious: what would a simple neural network learn on its own?

I trained a two-layer network (no hidden layer)—mathematically equivalent to logistic regression—on the same seven wells. (Yes, seven wells. I know. But stay with me.)

The network classified all seven wells correctly. But the real insight came from the weights it learned:

Attribute	Weight
Curvature	+14.6
Diffraction	+9.7
Coherence	~0
AVAz	−4.9
VVAZ	−14.5

Curvature and Diffraction were strongly positive—predictive of problems. Coherence contributed almost nothing. But AVAz and VVAZ were negative—the network learned to suppress them.

A way to think about negative weights: imagine training a network to identify ducks from a set of photos that includes birds, ducks, and people in duck suits. The network will learn to weight “duck features” positively, but also to weight “human features” negatively—to avoid being fooled by the costumes. In the Wilrich case, the AVAz and VVAZ attributes were like duck suits: they looked like fracture indicators, but they were actually measuring something else.

This was interesting. All five attributes have theoretical justification for detecting fractures. Why would the network actively discount two of them?

When I mentioned this result to Lee Hunt, he confirmed what he had long suspected (personal communication): the AVAz and VVAZ responses in the Wilrich were dominated by an overlying coal, not the fractures themselves. He had measured this effect and documented it in a 2014 paper, where multiple attributes—including AVAz—showed statistically significant correlations to coal thickness rather than to reservoir properties. The neural network had learned, from just seven data points, to suppress exactly the attributes that Lee’s domain knowledge had already flagged as problematic.

This is Mill’s Method of Residues in action: if you know something else causes an observation, subtract it out. And it’s a reminder that domain knowledge and data-driven methods can converge on the same answer when both are applied honestly. I found this deeply satisfying.

What the AI got right—and what it missed

More recently, I revisited this problem using ChatGPT with the Wolfram plugin. I wanted to see if an AI, given just the table and a prompt about Mill’s Methods, could reason its way to the same conclusions.

It did—mechanically. It correctly identified Curvature and Diffraction as the consistent factors among problem wells. It noted that neither attribute alone was a perfect discriminator. It even offered to run logistic regression.

But it missed the interpretive leap. It hedged with phrases like “although there are exceptions” when in fact there were no exceptions to the conjunction rule. And it didn’t articulate the pragmatic framing: that the goal wasn’t to find the true cause, but to build a defensible decision rule under uncertainty.

That framing—the shift from epistemology to operations—required domain knowledge and judgment. The AI could apply Mill’s Methods. It couldn’t tell me why that application was useful here.

Drafting this post, I worked with a different AI—Claude—and found the collaboration more useful in a different way: not for solving the problem, but for reflection. Having to explain the context, the history, the why of my interest helped me articulate what I’d been carrying around in my head for years. Sometimes the value of a thinking partner isn’t in the answers, but in the questions that force you to be clearer.

Why this stuck with me

I’ll be honest: I kept thinking about this problem for years. It became part of a longer arc of engagement with Lee’s work—first the statistics tutorial, then the Wilrich case study, each building on the last.

When I interviewed for a geophysics position (Lee was retiring, and I was a candidate for his role), I mentioned this case study. I pulled out a pen and paper and wrote the entire seven-well table from memory. They seemed impressed—not because memorizing a table is hard, but because it signaled that I’d actually enjoyed thinking about it. That kind of retention only happens when curiosity is real.

I didn’t get the job. The other candidate had more operational experience, and that was the right call. But the process was energizing, and I’m sure that enthusiasm carried into my next opportunity, where I landed happily and stayed for over six years.

I tell this not to brag, but to make a point: intellectual play compounds. You don’t always see the payoff immediately. Sometimes you explore a problem just because it’s interesting—because someone like Lee writes “You can do it, too” and you decide to take him seriously—and it pays dividends in ways you didn’t expect.

The convergence

Three very different approaches—19th-century inductive logic, a simple neural network, and (later) an AI assistant—all pointed to the same answer: Curvature and Diffraction predict drilling problems in this dataset. The AVO attributes are noise, or worse, misleading.

When three methods converge, you can trust the signal. And you can make decisions accordingly.

That’s the real lesson here: rigorous reasoning under uncertainty isn’t about finding the One True Cause. It’s about building defensible heuristics, being honest about what you don’t know, and updating as new data comes in. Mill understood this in 1843. A neural network can learn it from seven wells. And sometimes, so can an AI—with a little help.

I hope you enjoyed this as much as I enjoyed putting it together.

The original case study was presented by Lee Hunt in his CSEG talk “Value thinking: from the classical to the hyper-modern.” The neural network analysis is in my Geoscience_ML_notebook_4. Lee documented the coal correlation issue in Hunt et al., “Precise 3D seismic steering and production rates in the Wilrich tight gas sands of West Central Alberta” (SEG Interpretation, May 2014), and later reflected on confirmation bias as an obstacle to recognizing such issues in “Useful Mistakes, Cognitive Biases and Seismic” (CSEG Recorder, April 2021). My thanks to Lee for the original inspiration, for confirming the geological context, and for sharing the original presentation materials.

References and Links

Hunt, L., 2013, Many correlation coefficients, null hypotheses, and high value: CSEG Recorder, December 2013. Link
Hunt, L., S. Hadley, S. Reynolds, R. Gilbert, J. Rule, M. Kinzikeev, 2014, Precise 3D seismic steering and production rates in the Wilrich tight gas sands of West Central Alberta: SEG Interpretation, May 2014.
Hunt, L., 2021, Useful Mistakes, Cognitive Biases and Seismic: CSEG Recorder, April 2021.
My neural network analysis: Geoscience_ML_notebook_4
My earlier exploration of Lee’s production data: Data exploration in Python: distance correlation and variable clustering
ChatGPT + Wolfram session on Mill’s Methods: Gist

AI/HI Transparency Statement Modified from Brewin http://www.theguardian.com/books/2024/apr/04/why-i-wrote-an-ai-transparency-statement-for-my-book-and-think-other-authors-should-too

Has any text been generated using AI?	Yes
Has any text been improved or corrected using HI?	Yes

Additional context: This post emerged from a conversation with Claude AI (Anthropic). I provided the source materials (a ChatGPT + Wolfram session, a Jupyter notebook, personal history with the problem), direction, and editorial judgment throughout. Claude drafted the post based on these inputs and our discussion of structure, voice, and framing. I reviewed multiple draft, revised as needed, rewrote some key sections, and made all final decisions about what went to publication. The core analysis—Mill’s Methods, the neural network, the interpretation—was done by me years before this collaboration; the AI’s role was in helping articulate and structure that work for a blog audience.

Keep advancing your Python coding skills

Posted on October 22, 2020 by matteomycarta

October 22, 2020

In my last post I touched on the topic of continuously improving your geo-computing projects (also take a look at my chapter from the upcoming Software Underground book, 52 things you should know about geocomputing).

However, one aspect that I intentionally left out in was that of coding skills as I was planning to get back to it with a dedicated post, which you are reading just now.

2018 vs 2020 comparison of flag percentage calculation

In the Jupyter notebook I compare the results of seismic inversion from two methods (with or without inversion-tailored noise attenuation) using a custom function to flag poor prediction of the target well log using median/median absolute deviation as a statistic for the error; the results are shown below.

One may just do this visual comparison, but I also included calculations to count the number and percentage of samples that have been flagged for each case. Below is a cell of code from the Jupyter notebook (let’s call it 2020 code) that does just that .

zone_errors_a['flagged samples']=result_a.groupby('zone', sort=False).flag.sum().values
zone_errors_b['flagged samples']=result_b.groupby('zone', sort=False).flag.sum().values

def calc_proportion(dtf):
    """
    function to calculate proportion of flagged samples
    """
    x=dtf.flag
    return round(100 * x.sum()/len(x), 1)

zone_errors_a['proportion (%)']=result_a.groupby('zone',sort=False).apply(calc_proportion).values
zone_errors_b['proportion (%)']=result_b.groupby('zone',sort=False).apply(calc_proportion).values

I am a lot happier with this code than with the original code (circa 2018), which is in the cell below.

zones_a=list(result_a['zone'].unique())
zones_b=list(result_b['zone'].unique())

zone_errors_a['flagged samples']=[result_a.loc[result_a.zone==z,'flag'].sum() for z in zones_a]
zone_errors_b['flagged samples']=[result_b.loc[result_b.zone==z,'flag'].sum() for z in zones_b]

zone_errors_a['proportion (%)']=[round(result_a.loc[result_a.zone==z,  'flag'].sum()/len(result_a.loc[result_a.zone==z,'flag'])*100,1) for z in zones_a]                                


zone_errors_b['proportion (%)']=[round(result_b.loc[result_b.zone==z,  'flag'].sum()/len(result_b.loc[result_b.zone==z,'flag'])*100,1) for z in zones_b]

The major differences in the older code are:

I was using unique instead of Pandas’ groupby
I was using list comprehensions to work through the DataFrame, instead of Pandas’ apply and a custom function to calculate the percentages on the entire DataFrame at once.

I find the 2020 code much more tidy and easier to read.

Enters Pandas for everyone

The above changes happened in but a few hours over two evenings, after having worked through chapters 9 and 10 of Pandas for Everyone by Daniel Chen, a very accessible read for all aspiring data scientists, which I highly recommend (also, watch Daniel’s fully-packed 2019 Pycon tutorial).

And before you ask: no, you do not get the Agile Scientific sticker with the book, I am sorry.

🙂

Comparison of 2016 vs 2020 code snippets from the 2016 SEG Machine Learning contest

A second example is of code used to calculate the first and second derivatives for all geophysical logs from the wells in the 2016 SEG Machine Learning contest.

The two cells of code below do exactly the same thing: loop through the wells and for each one in turn loop through the logs, calculate the derivatives, add them to a temporary Pandas DataFrame, then concatenate into a single output DataFrame. In this case, the only difference is the moving away from unique to groupby.

I use the %%timeit cell magic to compare the runtimes for the two cells.

2016 code

%%timeit
# for training data
# calculate all 1st and 2nd derivative for all logs, for all wells
train_deriv_df = pd.DataFrame()             # final dataframe

for well in train_data['Well Name'].unique():        # for each well
    new_df = pd.DataFrame() # make a new temporary dataframe
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']: # for each log
        # calculate and write to temporary dataframe
        new_df[str(log) + '_d1'] = np.array(np.gradient(train_feat_df[log][train_feat_df['Well Name'] == well]))
        new_df[str(log) + '_d2'] = np.array(np.gradient(np.gradient(train_feat_df[log][train_feat_df['Well Name'] == well])))
         
    # append all rows of temporary dataframe to final dataframe          
    train_deriv_df = pd.concat([train_deriv_df, new_df])

86 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2020 code

%%timeit
# for training data
# calculate all 1st and 2nd derivative for all logs, for all wells
train_deriv_df = pd.DataFrame() # final dataframe

for _, data in train_feat_df.groupby('Well Name'): # for each well        
    new_df = pd.DataFrame()                        # make a new temporary dataframe
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']: # for each log
        # calculate and write to temporary dataframe 
        new_df[str(log) + '_d1'] = np.gradient(data[log])
        new_df[str(log) + '_d2'] = np.gradient(np.gradient(data[log]))

    # append all rows of temporary dataframe to final dataframe          
    train_deriv_df = pd.concat([train_deriv_df, new_df])

52.3 ms ± 353 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We go down to 52.3 ms from 86 ms, which is a modest improvement, but certainly the code is more compact and a whole lot lighter to read (i.e. more pythonic, or pandaish if you prefer): I am happy!

As an aside, if you want to know more about timing code execution, see section 1.07 from Jake VanderPlas’ outstanding Python Data Science Handbook, which I also cannot recommend enough (and do yourself a favor: watch his series Reproducible Data Analysis in Jupyter).

By the way, below I show the notebook code comparison generated using the nbdiff-web option from the awesome nbdime library, a recent discovery.

Geoscience Machine Learning bits and bobs – data completeness

Posted on September 19, 2020 by matteomycarta

2016 Machine learning contest – Society of Exploration Geophysicists

In a previous post I showed how to use pandas.isnull to find out, for each well individually, if a column has any null values, and sum to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:

for well, data in training_data.groupby('Well Name'): 
print(well)
print (data.isnull().values.any())
print (data.isnull().sum(), '\n')

Simple and quick, the output showed met that – for example – the well ALEXANDER D is missing 466 samples from the PE log:

ALEXANDER D
True
Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64

A more appealing and versatile alternative, which I discovered after the contest, comes with the matrix function form the missingno library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a missingno matrix plot.

for well, data in training_data.groupby('Well Name'): 

msno.matrix(data, color=(0., 0., 0.45)) 
fig = plt.gcf()
fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length 
axes=fig.get_axes()
axes[0].set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')

In each of the following plots, the sparklines at the right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset. This to me is a much more compelling and informative way to inspect log data as it shows the data range where data is missing. The well length is also annotated on the bottom left, by which information I learned that Recruit F9 is much shorter than the other wells. And to ensure this is reinforced I introduced a line to modify each plot so that its height reflects the length. I really like it!

2020 Machine Predicted Lithology – FORCE

Since I am taking part in this year’s FORCE Machine Predicted Lithology challenge, I decided to take the above visualization up a notch. Using Ipywidget’s interactive and a similar logic (in this case data['WELL'].unique()) the tool below allows browsing wells using a Select widget and check the chosen well’s curves completeness, on the fly. You can try the tool in this Jupyter notebook.

In a second Jupyter notebook on the other hand, I used missingno matrix to make a quick visual summary plot of the entire dataset completion, log by log (all wells together):

Then, to explore in more depth the data completion, below I also plotted the library’s dendrogram plot. As explained in the library’s documentation, The dendrogram uses a hierarchical clustering algorithm (courtesy of Scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.

Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.

Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.

Busting bad colormaps with Python and Panel

Posted on September 17, 2020 by matteomycarta

I have not done much work with, or written here on the blog about colormaps and perception in quite some time.

Last spring, however, I decided to build a web-based app to show the effects of using a bad colormaps. This stemmed from two needs: first, to further my understanding of Panel, after working through the awesome tutorial by James Bednar, Panel: Dashboards (at PyData Austin 2019); and second, to enable people to explore interactively the effects of bad colormaps on their perception, and consequently the ability to on interpret faults on a 3D seismic horizon.

I introduced the app at the Transform 2020 virtual subsurface conference, organized by Software Underground last June. Please watch the recording of my lightning talk as it explains in detail the machinery behind it.

I am writing this post in part to discuss some changes to the app. Here’s how it looks right now:

The most notable change is the switch from one drop-down selector to two-drop-down selectors, in order to support both the Matplotlib collection and the Colorcet collection of colormaps. Additionally, the app has since been featured in the resource list on the Awesome Panel site, an achievement I am really proud of.

You can try the app yourself by either running the notebook interactively with Binder, by clicking on the button below:

or, by copying and pasting this address into your browser:

https://mybinder.org/v2/gh/mycarta/Colormap-distorsions-Panel-app/master?urlpath=%2Fpanel%2FDemonstrate_colormap_distortions_interactive_Panel

Let’s look at a couple of examples of insights I gained from using the app. For those that jumped straight to this example, the top row shows:

the horizon, plotted using the benchmark grayscale colormap, on the left
the horizon intensity, derived using skimage.color.rgb2gray, in the middle
the Sobel edges detected on the intensity, on the right

and the bottom row, shows:

the horizon, plotted using the Matplotlib gist_rainbow colormap, on the left
the intensity of the colormapped, in the middle. This is possible thanks to a function that makes a figure (but does not display it), plots the horizon with the specified colormap, then saves plot in the canvas to an RGB numpy array
the Sobel edges detected on the colormapped intensity, on the right

I think the effects of this colormaps are already apparent when comparing the bottom left plot to the top left plot. However, simulating perception can be quite revealing for those that have not considered these effects before. The intensity in the bottom middle plot is very washed out in the areas corresponding to green color in the bottom left, and as a result, many of the faults are not visible any more, or only with much difficulty, which is demonstrated by the Sobel edges in the bottom right.

And if you are not quite convinced yet, I have created these hill-shaded maps, using Matt Hall”s delightful function from this notebook (and check his blog post):

Below is another example, using the Colocrcet cet_rainbow which is is one of Peter Kovesi’s perceptually uniform colormaps. I use many of Peter’s colormaps, but never used this one, because I use my own perceptual rainbow, which does not have a fully saturated yellow, or a fully saturated red. I think the app demonstrate, that even though they are more subtle , this rainbow still is introducing some artifacts. The yellow colour creates narrow flat bands, visible in the intensity and Sobel plots, and indicated by yellow arrows; the red colour is also bad as usual, causing an artificial decrease in intensity(magenta arrows).

Be a geoscience and data science detective

Featured

Posted on September 16, 2020 by matteomycarta

September 16, 2020

Introduction

These days everyone talks about data science. But here’s a question: if you are a geoscientist, and like me you have some interest in data science (that is, doing more quantitative and statistical analyses with data), why choose between the two? Do both… always! Your domain knowledge is an indispensable condition, and so is an attitude of active curiosity (for me, an even more important condition). So, my advice for aspiring geoscientists and data scientists is: “yes, do some course work, read some articles, maybe a book, get the basics of Python or R, if you do not know a programming language” but then jump right ahead into doing! But – please! – skip the Titanic, Iris, Cars datasets, or any other data from your MOOC or that many have already looked at! Get some data that is interesting to you as geoscientist, or put it together yourself.

Today I will walk you through a couple of examples; the first one, I presented as a lightning talk at the Transform 2020 virtual conference organized by Software Underground. This project had in fact begun at the 2018 Geophysics Sprint (organized by Agile Scientific ahead of the Annual SEG conference) culminating with this error flag demo notebook. Later on, inspired by watching the Stanford University webinar How to be a statistical detective, I decided to resume it, to focus on a more rigorous approach. And that leads to my last bit of advice, before getting into the details of the statistical analysis: keep on improving your geo-computing projects (if you want more advice on this aspect, take a look at my chapter from the upcoming Software Underground book, 52 things you should know about geocomputing): it will help you not only showcase your skills, but also your passion and motivation for deeper and broader understanding.

First example: evaluate the quality of seismic inversion from a published article

In this first example, the one presented at Transform 2020, I wanted to evaluate the quality of seismic inversion from a published article in the November 2009 CSEG Recorder Inversion Driven Processing. In the article, the evaluation was done by the authors at a blind well location, but only qualitatively, as illustrated in Figure 5, shown below for reference. In the top panel (a) the evaluation is for the inversion without additional processing (SPNA, Signal Protected Noise Attenuation); in the bottom panel (b) the evaluation is for the inversion with SPNA. On the right side of each panel the inverted seismic trace is plotted against the upscaled impedance well log (calculated by multiplying the well density log and the well velocity log from compressional sonic); on the right, the upscaled impedance log is inserted in a seismic impedance section as a colored trace (at the location of the well) using the same color scale and range used for the impedance section.

Figure 5 caption: Acoustic impedance results at the blind well for data without (a) and with (b) SPNA. The figure shows a 200 ms window of inverted seismic data with well B, the blind well, in the middle on the left, along with acoustic impedance curves for the well (red) and inverted seismic (blue) on the right. The data with SPNA shows a better fit to the well, particularly over the low frequencies.

What the authors reported in the figure caption is the extent to which the evaluation was discussed in the paper; unfortunately it is not backed up in any quantitative way, for example comparing a score, such as R^2, for the two methods. Please notice that I am not picking on this paper in particular, which in fact I rather quite liked, but I am critical of the lack of supporting statistics, and wanted to supplement the paper with my own.

In order to do that, I hand-digitized from the figure above the logs and inversion traces , then interpolated to regularly-sampled time intervals (by the way: if you are interested in a free tool to digitize plots, check use WebPlotDigitizer).

My plan was to split my evaluation in an upper and lower zone, but rather than using the seismically-picked horizon, I decided to add a fake top at 1.715 seconds, where I see a sharp increase in impedance in Figure 5. This was an arbitrary choice on my part of a more geological horizon, separating the yellow-red band from the green blue band in the impedance sections. The figure below shows all the data in a Matplotlib figure:

The first thing I did then, was to look at the Root Mean Square Error in the upper and lower zone obtained using the fake top. They are summarized in the table below:

Based on the RMSE , it looks like case b, the inversion with Signal Protected Noise Attenuated applied on the data, is a better result for the Upper zone, but not for the Lower one. This result is in agreement with my visual comparison of the two methods.

But lets’ dig a bit deeper. After looking at RMSE, I used the updated version of the error_flag function, which I first wrote at the 2018 Geophysics Sprint, listed below:

def error_flag(pred, actual, stat, dev = 1.0, method = 1):
    """Calculate the difference between a predicted and an actual curve 
    and return a curve flagging large differences based on a user-defined distance 
    (in deviation units) from either the mean difference or the median difference
    
Matteo Niccoli, October 2018. Updated in May 2020.
    
    Parameters:
        predicted : array
            predicted log array           
        actual : array
            original log array            
        stat : {‘mean’, ‘median’}
            The statistics to use. The following options are available:
                - mean: uses numpy.mean for the statistic, 
                and np.std for dev calculation
                - median: uses numpy.median for the statistic, 
                and scipy.stats.median_absolute_deviation (MAD) for dev calculation        
        dev : float, optional
            the standard deviations to use. The default is 1.0           
        method : int {1, 2, 3}, optional
            The error method to use. The following options are available
            (default is 1):
                1: difference between curves larger than mean difference plus dev
                2: curve slopes have opposite sign (after a 3-sample window smoothing)
                3: curve slopes of opposite sign OR difference larger than mean plus dev  
    Returns:
        flag : array
        The error flag array
    """   
    
    flag = np.zeros(len(pred))
    err = np.abs(pred-actual)
    
    if stat == 'mean':
        err_stat = np.mean(err)
        err_dev = np.std(err)
    elif stat == 'median':
        err_stat = np.median(err)
        err_dev = sp.stats.median_absolute_deviation(err)
        
    pred_sm = pd.Series(np.convolve(pred, np.ones(3), 'same'))
    actual_sm = pd.Series(np.convolve(actual, np.ones(3), 'same'))
    ss = np.sign(pred_sm.diff().fillna(pred_sm))
    ls = np.sign(actual_sm.diff().fillna(actual_sm))
                  
    if method == 1:
        flag[np.where(err>(err_stat + (dev*err_dev)))] = 1
    elif method == 2:      
        flag[np.where((ss + ls)==0 )]= 1
    elif method == 3:
        flag[np.where(np.logical_or(err>(err_stat + (dev*err_dev)), (ss+ls)==0 ))]= 1
    return flag

I believe this new version is greatly improved because:

Users now can choose between mean/standard deviation and median/median absolute deviation as a statistic for the error. The latter is more robust in the presence of outliers
I added a convolutional smoother prior to the slope calculation, so as to make it less sensitive to noisy samples
I expanded and improved the doctstring

The figure below uses the flag returned by the function to highlight areas of poorer inversion results, which I assigned based on passed to the function very restrictive parameters:

using median and a median absolute deviation of 0.5 to trigger flag
combining the above with checking for the slope sign

I also wrote short routines to count the number and percentage of samples that have been flagged, for each result, which are summarized in the table below:

The error flag method is in agreement with the RMS result: case b, the inversion on Signal Protected Noise Attenuated data is a better result for the Upper zone, but for the Lower zone the inversion without SPNA is the better one. Very cool!

But I was not satisfied yet. I was inspired to probe even deeper after a number of conversations with my friend Thomas Speidel, and reading the chapter on Estimation from Computational and Inferential Thinking (UC Berkeley). Specifically, I was left with the question in mind: “Can we be confident about those two “is better“ in the same way?

This question can be answered with a bootstrapped Confidence Interval for the proportions of flagged samples, which I do below using the code in the book, with some modifications and some other tools from the datascience library. The results are shown below. The two plots, one for the Upper and one for the Lower zone, show the distribution of bootstrap flagged proportions for the two inversion results, with SPNA in yellow, and without SPNA in blue, respectively, and the Confidence Intervals in cyan and brown, respectively (the CI upper and lower bounds are also added to the table, for convenience).

By comparing the amount (or paucity) of overlap between the distributions (and between the confidence intervals) in the two plots, I believe I can be more confident in the conclusion drawn for the Lower zone, which is that the inversion on data without SPNA is better (less proportion of flagged errors), as there is far less overlap.

I am very pleased with these results. Of course, there are some caveats to keep in mind, mainly that:

I may have introduced small, but perhaps significant errors with hand digitizing
I chose a statistical measure (the median and median absolute deviation) over a number of possible ones
I chose an arbitrary geologic reference horizon without a better understanding of the reservoir and the data, just the knowledge from the figure in the paper

However, I am satisfied with this work, because it points to a methodology that I can use in the future. And I am happy to share it! The Jupyter notebook is available on GitHub.

Second example: evaluate regression results from a published article

My second, more recent example, is shorter but no less interesting, in my opinion. It is a case study correlating inversion data to petrophysical estimates of porosity-height in the Doig and Montney Formations in Western Canada (from the paper Tight gas geophysics: AVO inversion for reservoir characterization, Close et al. CSEG Recorder, May 2010, an article which I enjoyed very much reading).

The authors indicated that Vp/Vs and/or Poisson’s ratio maps from seismic inversion are good indicators of porosity in the Lower Doig and Upper Montney reservoirs in the wells used in their study, so it was reasonable to try to predict Phi-H from Vp/Vs via regression. The figure below, from the article, shows one such Vp/Vs ratio map and the Vp/Vs vs. Phi-H cross-plot for 8 wells.

Figure 7 caption: Figure 7. a) Map of median Vp/Vs ratio value and porosity-height from 8 wells through the Lower Doig and Upper Montney. The red arrows highlight wells with very small porosity-height values and correspond in general to areas of higher Vp/Vs ratio. The blue arrow highlights a well at the edge of the seismic data where the inversion is adversely affected by decreased fold. The yellow line is the approximate location of a horizontal well where micro-seismic data were recorded during stimulation. b) Cross-plot of porosity-height values as shown in (a) against Vp/Vs extracted from the map. The correlation co-efficient of all data points (blue line) of 0.7 is improved to 0.9 (red line) by removing the data point marked by the blue arrow which again corresponds to the well near the edge of the survey (a).

They also show in the figure that by removing one of the data points, corresponding to a well near the edge of the survey (where the seismic inversion result is presumably not as reliable, due to lower offset and azimuth coverage), the correlation co-efficient is improved from 0.7 to 0.9 (red line).

So, the first thing I set out to do was to reproduce the crossplot. I again hand-digitized the porosity-height and Vp/Vs pairs in the cross-plot using again WebPlotDigitizer. However, switched around the axes, which seems more natural to me since the objectives of regression efforts would be to predict as the dependent variable Phi-h, at the location of yet to drill wells, given Vp/Vs from seismic inversion. I also labelled the wells using their row index, after having loaded them in a Pandas DataFrame. And I used Ordinary Least Square Regression from the statsmodels library twice: once with all data points, the second time after removal of the well labelled as 5 in my plot above.

So, I am able to reproduced the analysis from the figure. I think removing an outlier with insight from domain knowledge (the observation that poorer inversion result at this location is reasonable for the deviation from trend) is a legitimate choice. However, I would like to dig a bit deeper, to back up the decision with other analyses and tests, and to show how one might do it with their own data.

The first thing to look at is an Influence plot, which is a plot of the residuals, scaled by their standard deviation, against the leverage, for each observation. Influence plots are useful to distinguish between high leverage observations from outliers and are one of statsmodel ‘s standard Regression plots, so we get the next figure almost for free, with minor modifications to the default example). Here it is below, together with the OLS regression result, with all data points.

From the Influence plot it is very obvious that the point labelled as zero has high leverage (but not high normalized residual). This is not a concern because points with high leverage are important but do not alter much the regression model. On the other hand, the point labelled as 5 has very high normalized residual. This point is an outlier and it will influence the regression line, reducing the R^2 and correlation coefficient. This analysis is a robust way to legitimize removing that data point.

Next I run some inferential tests. As I’ve written in an earlier notebook on Data loading, visualization, significance testing, I find the critical r very useful in the context of bivariate analysis. The critical r is the value of the correlation coefficient at which you can rule out chance as an explanation for the relationship between variables observed in the sample, and I look at it in combination with the confidence interval of the correlation coefficient.

The two plots below display, in dark blue and light blue respectively, the upper and lower confidence interval bounds, as the correlation coefficient r varies between 0 and 1 (x axis). These two bounds will change with different number of wells (they will get closer with more wells, and farther apart with less wells). The lower bound intersects the x axis (y=0) axis at a value equal to the critical r (white dot). The green dots highlight the actual confidence interval for a specific correlation coefficient chosen, in this case 0.73 with 9 wells, and 0.9 with 8 wells.

By the way: these plots are screen captures from the interactive tool I built taking advantage of the Jupyter interactive functionality (ipywidgets). You can try the tool by running the Jupyter notebook.

With 9 wells, and cc=0.73, the resulting critical r = 0.67 tells us that for a 95% confidence level (0.05 alpha) we need at least a correlation coefficient of 0.67 in the sample (the 9 wells drilled) to be able to confidently say that there is correlation in the population (e.g. any well, future wells). However, the confidence interval is quite broad, ranging between 0.13 and 0.94 (you can get these numbers by running confInt(0.73, 9) in a cell.

With 8 wells (having removed the outlier), CC=0.9, the critical r is now 0.71, meaning that the requirement for rejecting the the null hypothesis (there is no association between Vp/Vs and Phi-H) is now a bit higher. However, a CC increased to 0.9, and only one less well, also results in a confidence interval ranging from 0.53 to 0.98, hence our confidence is greatly increased.

This second analysis also corroborates the choice of removing the outlier data point. One thing worth mentioning before moving on to the next test is that these confidence interval bounds are the expected population ones, based on the sample correlation coefficient and the number of observations; the data itself was not used. Of course, I could also have calculated, and shown you, the OLS regression confidence intervals, as I have done in this notebook (with a different dataset).

My final test involved using the distance correlation (dcor.distance_correlation) and p-value (dcor.independence.distance_covariance_test) from the dcor library. I have written before in a blog post how much I like the distance correlation, because it does not assume a linear relationship between variables, and because a distance correlation of zero does mean that there is no dependence between those two variables (contrary to Pearson and Spearman). In this case the relationship seems to be linear, but it is still a valuable test to compare DC and p-value before and after removing the outlier. Below is a summary of the test:

All data points:
D.C. =  0.745
p-value =  0.03939

Data without outlier:
D.C. =  0.917
p-value =  0.0012

The distance correlation values are very similar to the correlation coefficients from OLS, again going up once removed the outlier. But, more interestingly, with all data points, a p-value of 0.04079 is very close to the alpha of 0.05, whereas once we removed the outlier, the p-value goes down by a factor of 20, to 0.0018. This again backs up the decision to remove the outlier data point.

Introduction

The two intervals

The dataset

Where the difference matters: economic risk

Effect of sample size

Adding predictors

Closing

Share this:

Introduction

What is Picobot?

The Constraints

The Rules

The Goal

The Empty Room: From 7 to 6 Rules

The Strategy: Boustrophedon

My Initial Solution: January 6, 2015

The Insight: “C and F Are the Same”

The Final Solution: January 24, 2015

Verified

The Maze: From 16 to 12 Rules

The Strategy: Right-Hand Wall Following

Initial Solution: 16 Rules

Two-Phase Optimization

The Iteration: A Failed Attempt

The Final Solution

Verified

Making It Explicit: Starting State Matters

What I Learned

Try It Yourself

What’s Next

AI/HI (Human Intelligence) Transparency Statement

Share this:

A step back to 2020: Building the app from scratch

2026: Working with Claude

What changed (and what didn’t)

The honest trade-off

What this means going forward

References

Share this:

The problem

Mill’s Methods: 19th-century philosophy meets drilling risk

What a neural network learned

What the AI got right—and what it missed

Why this stuck with me

The convergence

References and Links

Share this:

2018 vs 2020 comparison of flag percentage calculation

Enters Pandas for everyone

Comparison of 2016 vs 2020 code snippets from the 2016 SEG Machine Learning contest

2016 code

2020 code

Share this:

2016 Machine learning contest – Society of Exploration Geophysicists

2020 Machine Predicted Lithology – FORCE

Share this:

Share this:

Introduction

First example: evaluate the quality of seismic inversion from a published article

Second example: evaluate regression results from a published article

Share this: