Geoscience Machine Learning bits and bobs – data inspection

If you have not read Geoscience Machine Learning bits and bobs – introduction, please do so first as I go through the objective and outline of this series, as well as a review of the dataset I will be using, which is from the  2016 SEG Machine LEarning contest.

*** September 2020 UPDATE ***

Although I have more limited time now, compared to 2016,  I am very excited to be participating in the 2020 FORCE Machine Predicted Lithology challenge. Most new work and blog posts will be about this new contest instead of the 2016 one.


OK, let’s begin!

With each post, I will add a new notebook to the GitHub repo here. The notebook that goes with this post is  called 01 – Data inspection.

Data inspection

The first step after loading the dataset is to create a Pandas DataFrame. With the describe method I get a lot of information for free:

Indeed, from the the first row in the summary I learn that about 20% of samples in the photoelectric effect column PE are missing.

I can use pandas.isnull to tell me, for each well, if a column has any null values, and sum to get the number of null values missing, again for each column.

for well in training_data['Well Name'].unique():
    w = training_data.loc[training_data['Well Name'] == well] 
    print (w.isnull().values.any())
    print (w.isnull().sum(), '\n')

Simple and quick, the output tells met, for example, that the well ALEXANDER D is missing 466 PE samples, and Recruit F9 is missing 12.

However,  the printout is neither easy, nor pleasant to read, as it is a long list like this:

Facies       0
Formation    0
Well Name    0
Depth        0
GR           0
ILD_log10    0
DeltaPHI     0
PHIND        0
PE           0
NM_M         0
RELPOS       0
dtype: int64 

Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64 

Recruit F9
Facies        0
Formation     0
Well Name     0
Depth         0
GR            0
ILD_log10     0
DeltaPHI      0
PHIND         0
PE           12
NM_M          0
RELPOS        0
dtype: int64


From those I can see that, apart from the issues with the PE log, GR has some high values in SHRIMPLIN, and so on…

All of the above is critical to determine the data imputation strategy, which is the topic of one of the next posts; but first in the next post I will use a number of visualizations of  the data, to examine its distribution by well and by facies, and to explore relationships among variables.

Geoscience Machine Learning bits and bobs – introduction

Bits and what?

After wetting (hopefully) your appetite with the Machine Learning quiz / teaser I am now moving on to a series of posts that I decided to title “Geoscience Machine Learning bits and bobs”.

OK, BUT fist of all, what does ‘bits and bobs‘ mean? It is a (mostly) British English expression that means “a lot of small things”.

Is it a commonly used expression? If you are curious enough you can read this post about it on the Not one-off British-isms blog. Or you can just look at the two Google Ngram plots below: the first is my updated version of the one in the post, comparing the usage of the expression in British vs. US English; the second is a comparison of its British English to that of the more familiar “bits and pieces” (not exactly the same according to the author of the blog, but the Cambridge Dictionary seems to contradict the claim).

I’ve chosen this title because I wanted to distill, in one spot, some of the best collective bits of Machine Learning that came out during, and in the wake of the 2016 SEG Machine Learning contest, including:

  • The best methods and insights from the submissions, particularly the top 4 teams
  • Things that I learned myself, during and after the contest
  • Things that I learned from blog posts and papers published after the contest

I will touch on a lot of topics but I hope that – in spite of the title’s pointing to a random assortment of things –  what I will have created in the end is a cohesive blog narrative and a complete, mature Machine Learning pipeline in a Python notebook.

*** September 2020 UPDATE ***

Although I have more limited time these days, compared to 2016,  I am very excited to be participating in the 2020 FORCE Machine Predicted Lithology challenge. Most new work and blog posts will be about this new contest instead of the 2016 one.


Some background on the 2016 ML contest

The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:

The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

The nine discrete facies (classes of rocks) are:

Tentative topics for this series

  • List of previous works (in this post)
  • Data inspection
  • Data visualization
  • Data sufficiency
  • Data imputation
  • Feature augmentation
  • Model training and evaluation
  • Connecting the bits: a full pipeline

List of previous works (comprehensive, to the best of my knowledge)

In each post I will make a point to explicitly reference whether a particular bit (or a bob) comes from a submitted notebook by a team, a previously unpublished notebook of mine, a blog post, or a paper.

However, I’ve also compiled below a list of all the published works, for those that may be interested.

The contest’s original article published by Brendon Hall on The Leading Edge, and the accompanying notebook

The Github repo with all teams’ submissions.

Two blog posts by Matt Hall of Agile Scientific, here and here

The published summary of the contest by Brendon Hall and Matt Hall on The Leading Edge

An SEG extended abstract on using gradient boosting on the contest dataset

An arXiv e-print paper on using a ConvNet on the contest dataset

Abstract for a talk at the 2019 CSEG / CSPG Calgary Geoconvention


How to fix rainbows and other bad colormaps using Python

Yep, colormaps again!

In my 2014 tutorial on The Leading Edge I showed how to Evaluate and compare colormaps (Jupyter notebook here). The article followed an extended series of posts (The rainbow is dead…long live the rainbow!) and then some more articles on rainbow-like colormap artifacts (for example here and here).

Last year, in a post titled Unweaving the rainbow, Matt Hall described our joint attempt to make a Python tool for recovering digital data from scientific images (and seismic sections in particular), without any prior knowledge of the colormap. Please check our GitHub repository for the code and slides, and watch Matt’s talk (very insightful and very entertaining) from the 2017 Calgary Geoconvention below:

One way to use the app is to get an image with unknown, possibly awful colormap, get the data, and re-plot it with a good one.

Matt followed up on colormaps with a more recent post titled No more rainbows! where he relentlessly demonstrates the superiority of perceptual colormaps for subsurface data. Check his wonderful Jupyter notebook.

So it might come as a surprise to some, but this post is a lifesaver for those that really do like rainbow-like colormaps. I discuss a Python method to equalize colormaps so as to render them perceptual.  The method is based in part on ideas from Peter Kovesi’s must-read paper – Good Colour Maps: How to Design Them – and the Matlab function equalisecolormap, and in part on ideas from some old experiments of mine, described here, and a Matlab prototype code (more details in the notebook for this post).

Let’s get started. Below is a time structure map for a horizon in the Penobscot 3D survey (offshore Nova Scotia, licensed CC-BY-SA by dGB Earth Sciences and The Government of Nova Scotia). Can you clearly identify the discontinuities in the southern portion of the map? No?


OK, let me help you. Below I am showing the map resulting from running a Sobel filter on the horizon. Penobscop_sobel

This is much better, right? But the truth is that the discontinuities are right there in the original data; some, however, are very hard to see because of the colormap used (nipy spectral, one of the many Matplotlib cmaps),  which introduces perceptual artifacts, most notably in the green-to-cyan portion.

In the figure below, in the first panel (from the top) I show a plot of the colormap’s Lightness value (obtained converting a 256-sample nipy spectral colormap from RGB to Lab) for each sample; the line is coloured by the original RGB colour. This erratic Lightness profile highlights the issue with this colormap: the curve gradient changes magnitude several times, indicating a nonuniform perceptual distance between samples.

In the second panel, I show a plot of the cumulative sample-to-sample Lightness contrast differences, again coloured by the original RGB colours in the colormap. This is the best plot to look at because flat spots in the cumulative curve correspond to perceptual flat spots in the map, which is where the discontinuities become hard to see. Notice how the green-to-cyan portion of this curve is virtually horizontal!

That’s it, it is simply a matter of very low, artificially induced perceptual contrast.

Solutions to this problem: the obvious one is to Other NOT use this type of colormaps (you can learn much about which are good perceptually, and which are not, in here); a possible alternative is to fix them. This can be done by re-sampling the cumulative curve so as to give it constant slope (or constant perceptual contrast). The irregularly spaced dots at the bottom (in the same second panel) show the re-sampling locations, which are much farther apart in the perceptually flat areas and much closer in the more dipping areas.

The third panel shows the resulting constant (and regularly sampled) cumulative Lightness contrast differences, and the forth and last the final Lightness profile which is now composed of segments with equal Lightness gradient (in absolute value).

equalization_pictorialHere is the structure map for the Penobscot horizon using the nipy spectum before and after equalization on top of each other, to facilitate comparison. I think this method works rather well, and it will allow continued use of their favourite rainbow and rainbow-like colormaps by hard core aficionados.



If you want the code to try the equalization, get the noteboook on GitHub.

Adding normal and reverse faults to an impedance model with Python

This is a short and sweet follow-up to yesterday’s post Making many synthetic seismic models in Python, in which I want to show how to add oblique faults to an impedance model, as opposed to a vertical one.

In the pseudo-code below, with the first line of code I select one of the impedance models from the many I created, then in lines 2 and 3, respectively, make two new versions, one shifted up 5 samples, one shifted down 5 samples. The impedance values for the extra volume added – at the bottom in the first case, the top in the second – are derived from the unshifted impedance model.

Next, I make two copies of the original, call them normal and reverse, and then replace the upper triangle with the upper triangle of the shifted down and shifted up versions, respectively.

unshifted = impedances[n]
up = sp.ndimage.interpolation.shift(unshifted, (5,0), cval=unshifted[0,0] * 0.9)
down = sp.ndimage.interpolation.shift(unshifted, (-5,0), cval=unshifted[0,-1] * 0.8)
normal = copy.deepcopy(unshifted)
reverse = copy.deepcopy(unshifted)
sz = len(unshifted)-1
normal[np.triu_indices(sz)] = up[np.triu_indices(sz)]
reverse[np.triu_indices(sz)] = down[np.triu_indices(sz)]

The results are shown in the figure below.

Left: unfaulted impedance model; center, model with normal fault; right, model with reverse fault.

Computer vision in geoscience: recover seismic data from images, introduction

In a recent post titled Unweaving the rainbow, Matt Hall described our joint attempt (partly successful) to create a Python tool to enable recovery of digital data from any pseudo-colour scientific image (and a seismic section in particular, like the one in Figure 1), without any prior knowledge of the colormap.

Seismic picture on wall

Figure 1. Test image: a photo of a distorted seismic section on my wall.

Please check our GitHub repository for the code and slides and watch Matt’s talk (very insightful and very entertaining) from the 2017 Calgary Geoconvention below:

In the next two post, coming up shortly, I will describe in greater detail my contribution to the project, which focused on developing a computer vision pipeline to automatically detect  where the seismic section is located in the image, rectify any distortions that might be present, and remove all sorts of annotations and trivia around and inside the section. The full workflow is included below (with sections I-VI developed to date):

  • I – Image preparation, enhancement:
    1. Convert to gray scale
    2. Optional: smooth or blur to remove high frequency noise
    3. Enhance contrast
  • II – Find seismic section:
    1. Convert to binary with adaptive or other threshold method
    2. Find and retain only largest object in binary image
    3. Fill its holes
    4. Apply opening and dilation to remove minutiae (tick marks and labels)
  • III – Define rectification transformation
    1. Detect contour of largest object find in (2). This should be the seismic section.
    2. Approximate contour with polygon with enough tolerance to ensure it has 4 sides only
    3. Sort polygon corners using angle from centroid
    4. Define new rectangular image using length of largest long and largest short sides of initial contour
    5. Estimate and output transformation to warp polygon to rectangle
  • IV – Warp using transformation
  • V – Blanking annotations inside seismic section (if rectangular):
    1. Start with output of (4)
    2. Pre-process and apply canny filter
    3. Find contours in the canny filter smaller than input size
    4. Sort contours (by shape and angular relationships or diagonal lengths)
    5. Loop over contours:
      1. Approximate contour
      2. If approximation has 4 points AND the 4 semi-diagonals are of same length: fill contour and add to mask
  • VI – Use mask to remove text inside rectangle in the input and blank (NaN) the whole rectangle. 
  • VII – Optional: tools to remove arrows and circles/ellipses:
    1. For arrows – contours from (4) find ones with 7 sizes and low convexity (concave) or alternatively Harris corner and count 7 corners, or template matching
    2. For ellipses – template matching or regionprops
  • VIII – Optional FFT filters to remove timing lines and vertical lines

You can download from GitHub all the tools for the automated workflow (parts I-VI) in the module, as well as an example Jupyter Notebook showing how to run it.

The first post focuses on the image pre-processing and enhancement, and the detection of the seismic line (sections I and II, in green); the second one deals with the rectification of the seismic (sections IV to V, in blue). They are not meant as full tutorials, rather as a pictorial road map to (partial) success, but key Python code snippets will be included and discussed.

A study of Ricker wavelets in MS Excel

There, I said it, and aloud: ‘EXCEL’!

I got this idea of making a modern (no, I am NOT kidding) educational tool to interactively construct and study Ricker wavelets after reading William Ashcroft’s A Petroleum Geologist’s Guide to Seismic Reflection. The book is a gem, in my opinion (I discovered it thanks to a blog post by Matt Hall a couple of years ago) and comes with some interesting tools on a DVD, which are unfortunately a bit outdated. So I went back to my old Geophysics Bible (Yilmaz’ book Seismic Data Analysis), and mashed a few ideas up; you can Download the Excel file here. Please feel free to use it, peruse it, and abuse it, in your classes, labs, nightmares, etcetera…

The tool is fairly simple. In Sheet 1 the user enters the dominant frequency of the desired Ricker wavelet, as shown in the middle of Figure 1. From that informatin the wavelet is constructed using the equation A = g^2 * 1/exp g^2 where g is the ration between frequency f  (in increments of 5 Hz up to an arbitrary 125 Hz – but this could be easily changed!) and the dominant frequency f1 just entered.  frequencies. The frequency spectrum of the wavelet is shown as a graph.

sheet 1 input and amplitude spectrum 30 hz

Figure 1

In Figure 2 I show the same Sheet, but for a wavelet of dominant frequency equal to 50 Hz.

sheet 1 input and amplitude spectrum 50 hz

Figure 2

A number of ancillary graphs, shown in Figure 3, display individual building blocks of the formula A = g^2 * 1/exp g^2.

Figure 3

Figure 3

In Sheet 2 the user can view a plot of the wavelet in the time domain,  add Constant phase shifts and Linear phase shifts, and experiment with different combinations of them, as shown in Figures 4-7, below.

sheet 2 zero phase

Figure 4

sheet 2 linear phase shift 180

Figure 5

sheet 2 constant phase shift 90

Figure 6

sheet 2 constant 90 and linear 180 phase shifts

Figure 7

Finally, Sheet 3 displays a plot of the wavelet, and of the individual frequency components that make it up. Have fun!!!

sheet 3 component sinusoids

Figure 8

References and further playing

A Petroleum Geologist’s Guide to Seismic Reflection, William Ashcroft, 2011, Wiley.

Seismic Data Analysis, Öz Yilmaz, 2001, Society of Exploration Geophysicists.

To plot a wavelet [in Python]. blog post by Evan Bianco.

Mapping and validating geophysical lineaments with Python

In Visualization tips for geoscientists: MATLAB, Part III I showed there’s a qualitative correlation between occurrences of antimony mineralization in the southern Tuscany mining district and the distance from lineaments derived from the total horizontal derivative (also called maximum horizontal gradient).

Let’s take a look at the it below (distance from lineaments increases as the color goes from blue to green, yellow, and then red).


However, in a different map in the same post I showed that lineaments derived using the maxima of the hyperbolic tilt angle (Cooper and Cowan, 2006, Enhancing potential field data using filters based on the local phase) are offset systematically from those derived using the total horizontal derivative.

Let’s take a look at the it below: in this case Bouguer gravity values increase as the color goes from blue to green, yellow, and then red; white polygons are basement outcrops.

The lineaments from the total horizontal derivative are in black, those from the maxima of hyperbolic tilt angle are in gray. Which lineaments should be used?

The ideal way to map the location of density contrast edges (as a proxy for geological contacts) would be  to gravity forward models, or even 3D gravity inversion, ideally constrained by all available independent data sources (magnetic or induced-polarization profiles, exploratory drilling data, reflection seismic interpretations, and so on).

The next best approach is to map edges using a number of independent gravity data enhancements, and then only use those that collocate.

Cooper and Cowan (same 2006 paper) demonstrate that no single-edge detector method is a perfect geologic-contact mapper. Citing Pilkington and Keating (2004, Contact mapping from gridded magnetic data – A comparison of techniques) they conclude that the best approach is to use “collocated solutions from different methods providing increased confidence in the reliability of a given contact location”.

I show an example of such a workflow in the image below. In the first column from the left is a map of the residual Bouguer gravity from a smaller area of interest in the southern Tuscany mining district (where measurements were made on a denser station grid). In the second column from the left are the lineaments extracted using three different (and independent) derivative-based data enhancements followed by skeletonization. The same lineaments are superimposed on the original data in the third column from the left. Finally, in the last column, the lineaments are combined into a single collocation map to increase confidence in the edge locations (I applied a mask so as to display edges only where at least two methods collocate).


If you want to learn more about this method, please read my note in the Geophysical tutorial column of The Leading Edge, which is available with open access here.
To run the open source Python code, download the iPython/Jupyter Notebook from GitHub.

With this notebook you will be able to:

1) create a full suite of derivative-based enhanced gravity maps;

2) extract and refine lineaments to map edges;

3) create a collocation map.

These technique can be easily adapted to collocate lineaments derived from seismic data, with which the same derivative-based enhancements are showing promising results (Russell and Ribordy, 2014, New edge detection methods for seismic interpretation.)

Moiré Patterns

Moiré pattern

Some time ago I reblogged a post from El Ojo Inoportuno showing Moiré pattern, which resulted from taking a photo of a circular pattern of (beautiful) tiles. This phenomenon is caused by undersampling and is also called space aliasing. There’s a very good explanation of space aliasing and another stunning Moiré example on Agile Geoscience’s post N is for Nyquist.

Creating Moiré patterns

One way to get Moiré pattern is to superimpose two identical, transparent line gratings and rotate one by an angle. You can see an animation of this on Wolfram Mathworld here; notice that the pattern varies with the angle. In the same page there’s also an example of Moiré Patterns generated by plotting series of curves on a computer screen, which is very similar to taking the photo of circular tiles shown in the Ojo Inoportuno photo. Again the interference is caused by representing circles with a finite size pixel grid. If you are interested you can experiment with these effects and many more by downloading templates from this site. Figure 1 shows my own Moiré from circular patterns.


Figure 1


There is a program for interactive Moiré pattern experiments called iMoiré.

Another way to get a Moiré pattern is to scan a picture printed with halftone. There’s a simple explanation of this scanning-generated interference here. Again this is a matter of aliasing, or undersampling. Here’s a good example:

Figure 2

The original image is a lovely watercolor by  Ettore Roesler Franz showing medieval houses along the Tiber river in Rome. The Moiré Pattern results from scanning the watercolor from one of the book collections (the image was posted on Flickr here).

How to remove Moiré pattern from digital images

For a quick solution, there’s a good article with detailed instructions on how to remove Moiré pattern in Photoshop, Paint Shop Pro, etcetera. For a more advanced workflow there’s an excellent top hat filter in Photoshop included in Reindeer Graphic’s FoveaPro plugin. In Figure 3, I created a sort of pictorial chart of this workflow using low resolution copies of examples in The Image Processing Cookbook, by John C. Russ.


Figure 3


In future posts I plan to show how to remove Moire’ pattern with open source code images  using Python, and then to extend the workflow to the removal (or attenuation) of acquisition footprint in seismic data, which has a very similar appearance in the 2D Fourier domain, and can be filtered with very similar techniques.