Geoscience Machine Learning bits and bobs – data completeness

2016 Machine learning contest – Society of Exploration Geophysicists

In a previous post I showed how to use pandas.isnull to find out, for each well individually, if a column has any null values, and sum to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:

for well, data in training_data.groupby('Well Name'): 
print (data.isnull().values.any())
print (data.isnull().sum(), '\n')

Simple and quick, the output showed met that  – for example – the well ALEXANDER D is missing 466 samples from the PE log:

Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64

A more appealing and versatile alternative, which I discovered after the contest, comes with the matrix function form the missingno library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a missingno matrix plot.

for well, data in training_data.groupby('Well Name'): 

msno.matrix(data, color=(0., 0., 0.45)) 
fig = plt.gcf()
fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length 
axes[0].set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')

I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.

Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.

Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.

Upcoming book: 52 things you should know about Geocomputing

I am very excited to write that, after a successful second attempt at collecting enough essays, the book 52 Things You Should Know About Geocomputing, by Agile Libre, is very likely to become a reality.

*** July 2020 UPDATE ***

This project got a much needed boost during the hackathon at the 2020 Transform virtual event. Watch the video recording on the group YouTube channel here.


In one of the three chapter I submitted for this book, Prototype colourmaps for fault interpretation, I talk about building a widget for interactive generation of grayscale colourmaps with sigmoid lightness. The whole process is well described in the chapter and showcased in the accompanying GitHub repo.

This post is an opportunity for me to reflect on how revisiting old projects is very important. Indeed, I consider it an essential part of how I approach scientific computing, and a practical way to incorporate new insights, and changes (hopefully betterments) in your coding abilities.

In the fist version of the Jupyter notebook, submitted in 2017, all the calculations and all the plotting commands where packed inside a single monster function that was passed to ipywidgets.interact. Quite frankly, as time passed this approach seemed less and less Phytonic (aka mature) and no longer representative of my programming skills and style, and increased understanding of widget objects.

After a significant hiatus (2 years) I restructured the whole project in several ways:
– Converted Python 2 code to Python 3
– Created separate helper functions for each calculation and moved them to the top to improve on both clarity and reusability of the code.
– Improved and standardized function docstrings
– Optimized and reduced the number of parameters
– Switched from interact to interactive to enable access to the colormaparray in later cells (for printing, further plotting, and exporting).

The new notebook is here.

In a different chapter in the book I talk at more length about the need to Keep on improving your geocomputing projects.