Geoscience Machine Learning bits and bobs – data completeness

2016 Machine learning contest – Society of Exploration Geophysicists

In a previous post I showed how to use pandas.isnull to find out, for each well individually, if a column has any null values, and sum to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:

for well, data in training_data.groupby('Well Name'): 
print(well)
print (data.isnull().values.any())
print (data.isnull().sum(), '\n')

Simple and quick, the output showed met that  – for example – the well ALEXANDER D is missing 466 samples from the PE log:

ALEXANDER D
True
Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64

A more appealing and versatile alternative, which I discovered after the contest, comes with the matrix function form the missingno library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a missingno matrix plot.

for well, data in training_data.groupby('Well Name'): 

msno.matrix(data, color=(0., 0., 0.45)) 
fig = plt.gcf()
fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length 
axes=fig.get_axes()
axes[0].set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')

I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.

Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.

Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.

Geoscience Machine Learning bits and bobs – data inspection

If you have not read Geoscience Machine Learning bits and bobs – introduction, please do so first as I go through the objective and outline of this series, as well as a review of the dataset I will be using, which is from the  2016 SEG Machine LEarning contest.

*** September 2020 UPDATE ***

Although I have more limited time now, compared to 2016,  I am very excited to be participating in the 2020 FORCE Machine Predicted Lithology challenge. Most new work and blog posts will be about this new contest instead of the 2016 one.

***************************

OK, let’s begin!

With each post, I will add a new notebook to the GitHub repo here. The notebook that goes with this post is  called 01 – Data inspection.

Data inspection

The first step after loading the dataset is to create a Pandas DataFrame. With the describe method I get a lot of information for free:

Indeed, from the the first row in the summary I learn that about 20% of samples in the photoelectric effect column PE are missing.

I can use pandas.isnull to tell me, for each well, if a column has any null values, and sum to get the number of null values missing, again for each column.

for well in training_data['Well Name'].unique():
    print(well)
    w = training_data.loc[training_data['Well Name'] == well] 
    print (w.isnull().values.any())
    print (w.isnull().sum(), '\n')

Simple and quick, the output tells met, for example, that the well ALEXANDER D is missing 466 PE samples, and Recruit F9 is missing 12.

However,  the printout is neither easy, nor pleasant to read, as it is a long list like this:

SHRIMPLIN
False
Facies       0
Formation    0
Well Name    0
Depth        0
GR           0
ILD_log10    0
DeltaPHI     0
PHIND        0
PE           0
NM_M         0
RELPOS       0
dtype: int64 

ALEXANDER D
True
Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64 

Recruit F9
True
Facies        0
Formation     0
Well Name     0
Depth         0
GR            0
ILD_log10     0
DeltaPHI      0
PHIND         0
PE           12
NM_M          0
RELPOS        0
dtype: int64
...
...

 

From those I can see that, apart from the issues with the PE log, GR has some high values in SHRIMPLIN, and so on…

All of the above is critical to determine the data imputation strategy, which is the topic of one of the next posts; but first in the next post I will use a number of visualizations of  the data, to examine its distribution by well and by facies, and to explore relationships among variables.

Geoscience Machine Learning bits and bobs – introduction

Bits and what?

After wetting (hopefully) your appetite with the Machine Learning quiz / teaser I am now moving on to a series of posts that I decided to title “Geoscience Machine Learning bits and bobs”.

OK, BUT fist of all, what does ‘bits and bobs‘ mean? It is a (mostly) British English expression that means “a lot of small things”.

Is it a commonly used expression? If you are curious enough you can read this post about it on the Not one-off British-isms blog. Or you can just look at the two Google Ngram plots below: the first is my updated version of the one in the post, comparing the usage of the expression in British vs. US English; the second is a comparison of its British English to that of the more familiar “bits and pieces” (not exactly the same according to the author of the blog, but the Cambridge Dictionary seems to contradict the claim).

I’ve chosen this title because I wanted to distill, in one spot, some of the best collective bits of Machine Learning that came out during, and in the wake of the 2016 SEG Machine Learning contest, including:

  • The best methods and insights from the submissions, particularly the top 4 teams
  • Things that I learned myself, during and after the contest
  • Things that I learned from blog posts and papers published after the contest

I will touch on a lot of topics but I hope that – in spite of the title’s pointing to a random assortment of things –  what I will have created in the end is a cohesive blog narrative and a complete, mature Machine Learning pipeline in a Python notebook.

*** September 2020 UPDATE ***

Although I have more limited time these days, compared to 2016,  I am very excited to be participating in the 2020 FORCE Machine Predicted Lithology challenge. Most new work and blog posts will be about this new contest instead of the 2016 one.

***************************

Some background on the 2016 ML contest

The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:

The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

The nine discrete facies (classes of rocks) are:

Tentative topics for this series

  • List of previous works (in this post)
  • Data inspection
  • Data visualization
  • Data sufficiency
  • Data imputation
  • Feature augmentation
  • Model training and evaluation
  • Connecting the bits: a full pipeline

List of previous works (comprehensive, to the best of my knowledge)

In each post I will make a point to explicitly reference whether a particular bit (or a bob) comes from a submitted notebook by a team, a previously unpublished notebook of mine, a blog post, or a paper.

However, I’ve also compiled below a list of all the published works, for those that may be interested.

The contest’s original article published by Brendon Hall on The Leading Edge, and the accompanying notebook

The Github repo with all teams’ submissions.

Two blog posts by Matt Hall of Agile Scientific, here and here

The published summary of the contest by Brendon Hall and Matt Hall on The Leading Edge

An SEG extended abstract on using gradient boosting on the contest dataset

An arXiv e-print paper on using a ConvNet on the contest dataset

Abstract for a talk at the 2019 CSEG / CSPG Calgary Geoconvention

 

Machine Learning quiz – part 1 of 3

Introduction

I’ve been meaning to write about the 2016 SEG Machine Learning Contest for some time. I am thinking of a short and not very structured series (i.e. I’ll jump all over the place) of 2, possibly 3 posts (with the exclusion of this quiz). It will mostly be a revisiting – and extension – of some work that team MandMs (Mark Dahl and I) did, but not necessarily posted. I will touch most certainly on cross-validation, learning curves, data imputation, maybe a few other topics.

Background on the 2016 ML contest

The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:

The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

The nine discrete facies (classes of rocks) are:

For some examples of the work during the contest, you can take a look at the original notebook, one of the submissions by my team, where we used Support Vector Classification to predict the facies, or a submission by the one of the top 4 teams, all of whom achieved the highest scores on the validation data with different combinations of Boosted Trees trained on augmented features alongside the original features.

QUIZ

Just before last Christmas, I run a little fun experiment to resume work with this dataset. I decided to turn the outcome into a quiz.

Below I present the predicted rock facies from two distinct models, which I call A and B. Both were trained to predict the same labeled facies picked by the geologist, which are shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions. Which predictions are “better”?

Please be warned, the question is a trick one. As you can see, I am gently leading you to make a visual, qualitative assessment of “better-ness”, while being absolutely vague about the models and not giving any information about the training process, which is intentional, and – yes! – not very fair. But that’s the whole point of this quiz, which is really a teaser to the series.

Machine Learning in Geoscience with Scikit-learn. Part 2: inferential statistics and domain knowledge to select features for oil prediction

In the first post of this series I showed how to use Pandas, Seaborn, and Matplotlib to:

  • load a dataset
  • test, clean up, and summarize the data
  • start looking for relationships between variables using scatterplots and correlation coefficients

In this second post, I will expand on the latter point by introducing some tests and visualizations that will help highlight the possible criteria for choosing some variables, and dropping others. All in Python.

I will use a different dataset than that in the previous post. This one is from the paper “Many correlation coefficients, null hypotheses, and high value“(Lee Hunt, CSEG Recorder, December 2013).

The target to be predicted is oil production from a marine barrier sand. We have measured production (in tens of barrels per day) and 7 unknown (initially) predictors, at 21 wells.

Hang on tight, and read along, because it will be a wild ride!

I will show how to:

1) automatically flag linearly correlated predictors, so we can decide which might be dropped. In the example below (a matrix of pair-wise correlation coefficients between variables) we see that X2, and X7, the second and third best individual predictors of production (shown in the bottom row) are also highly correlated to X1, the best overall predictor.

2) automatically flag predictors that fail a critical r test

3) create a table to assess the probability that a certain correlation is spurious, in other words the probability of getting at least the correlation coefficient we got with our the sample, or even higher, purely by chance.

I will not recommend to run these tests and apply the criteria blindly. Rather, I will suggest how to use them to learn more about the data, and in conjunction with domain knowledge about the problem at hand (in this case oil production), make more informed choices about which variables should, and which should not be used.

And, of course, I will show how to make the prediction.

Have fun reading: get the Jupyter notebook on GitHub.

Geology in pictures – meanders and oxbow lakes

I love meandering rivers. This is a short post with some (I think) great images  of meandering rivers and oxbow lakes.

In my previous post Google Earth and a 5 minute book review: Geology illustrated I reviewed one of my favorite coffee table books: Geology Illustrated by John S. Shelton.

The last view in that post was as close a perspective replica of Figure 137 in the book as I could get using Google Earth, showing the meander belt of the Animas River a few miles from Durango, Colorado.

What a great opportunity to create a short time lapse: a repeat snapshots of the same landscape nearly 50 years apart. This is priceless: 50 years are in many cases next to nothing in geological terms, and yet there are already some significant differences in the meanders in the two images, which I have included together below.

Shelton_137_copyright

Meander belt and floodplain of the Animas River a few miles from Durango, Colorado. From Geology Illustrated, John S. Shelton, 1966. Copyright of the University of Washington Libraries, Special Collections (John Shelton Collection, Shelton 2679).

Meander belt and floodplain of the Animas River as above, as viewed on Google Earth in 2013.

For example, it looks like the meander cutoff in the lower left portion of the image had likely ‘just’ happened in the 60s, whereas at the time the imagery used by Google Earth was acquired, the remnant oxbow lake seems more clearly defined. Another oxbow lake in the center has nearly disappeared, perhaps in part due to human activity.

Below are two pictures of the Fraser River in the Robson Valley that I took in the summer of 2013 during a hike to McBride Peak. This is an awesome example of oxbow lake. We can still see where the river used to run previous to the jump.

horseshoe_panorama

horseshoe_closeup

And this is a great illustration (from Infographic illustrations) showing how these oxbow lakes are created. I love the hands-on feel…ox-bow-lake-infographic

The last is an image of tight meander loop from a previous post: Some photos of Northern British Columbia wildlife and geology.

tight_loop

 

3D sketchfab model of the ternary system quartz – nepheline – kalsilite

Another old model brought back to life. This is the Click on the image below to see the model in action.  This is the ternary system quartz – nepheline – kalsilite, also called petrogeny’s residua system, which is used to describe the composition of many cooled residual magmas and the genesis of very under saturated alkaline rocks.

Ternary system quartz - nepheline - kalsilite

More on this classification, used for cooled residual magmas, on Alessandro Da Mommio’s website, for example his page on Tracheite.

Some photos of Northern British Columbia wildlife and geology

Introduction

Last week I went  on a helicopter ride with Gerry, my father in-law, to count of Kokanee Salmon in the  Camp Creek near Valemount, BC. We were invited by Curtis Culp of Dunster, BC, which is in charge of this  conservation effort run by BC Hydro. Here’s a picture of Gerry leaving the chopper (from Yellowhead Helicopters).

chopper

Kokanee salmon is a land-locked relative of Sockeye salmon. This means that they spend all their life in inland lakes, never seeing the ocean. For spawning  they enter inlet streams of the lake where they live. Camp Creek is a smaller tributary of the Canoe River, inlet of the Kinbasket Lake, where these Kokanee live.

The number of fish is estimated visually from the helicopter using a hand-held tally counter (every ~100-fish patch is a click). As a matter of facts, Curtis and Gerry counted fish, and I went along for the fun. Their estimates were really close, coming in at 15,000 and 15,400 in ~35 minutes over a ~15 km stretch of the Camp Creek. I counted 15 between Bald eagle and Golden eagle, and took some photos. Here they are!

Wildlife and nature photos

The first two are photos looking straight down the Camp Creek. Believe it or not, there’s fish there. See the dark spots? Those are Kokanee Salmon. And the job was to count them, so I am glad I did not have to (although it was easier to the naked eye).

Kokanee4

Kokanee3

The next two are a couple of photos taken at the ground level, courtesy of Curtis. Here the salmon is easy to see.

Kokanee ground

Kokanee ground1

The next two are also photos of the creek from the helicopter. There’s fish in there but I can only say it because I saw them, I can’t quite make them up in the photos. I love the shots though, the crystal clear water and the shadows.

Kokanee1

Kokanee2

The following two are photos with eagles. I could not believe how tiny they look, since even at this distance they seemed huge to the naked eye. There is a bald eagle in the first photo (middle left) , the other two (in the middle of the second photo) are too tiny, it is hard to say.

eagle

two_eagles

This last one is a photo of the trees, just looking down. I find it mesmerizing.trees

Upon looking at all the photos (I took about 80) I have to say that as much as I love my iPhone 4S, they are not nearly as good as I had wished for. Certainly far from the photos I shot during a claim staking trip in the Cassiar Mountains near Watson Lake, Yukon using a Canon FTb 35 mm (one of these days I’ll have to get those photos out of the attic and publish some of them). I often think of going back to my reflex camera, although I hear the iPhone 5 camera is a big improvement, with the iPhone 5S being even faster, so there’s hope.

Bonus photo

Here’s a beautiful elk. I took it another day, on the highway just outside of Jasper, but I thought it would fit in here.

elk

Geology photos

I love meander rivers so I took a whole lot of photos of the creek. The first one shows a nice sandbar right where we started the counting.

start

This next one is a nice shot of the meandering creek looking back.

Camp Creek

In this third one you can see two nicely developed meander loops with point bars.

point_bar

Last, but not least, a really tight meander. I love this photo, it’s my overall favourite.

tight_loop

Human activity photos

I am also including some photos showing the human footprint on the land. This first one is a clearing – I am not certain for what purpose, likely a new development. The circular patches are places where the logs were collected and burned. Quite the footprint, seen from here.

clearing

Next is something I did not expect to see here, a golf course – although I probably should have…. they are omnipresent, and often obnoxious, to say the least.

golf_course

This is one I quite like: the creek, the railway, and the Yellowhead highway, all running next to one another.

three_tracks

The team

Finally, a shot of Gerry and I in the back of the chopper and one of Gerry counting.

duo

counting