Machine Learning quiz – part 3 of 3

In my last two posts I published part 1 and part 2 of this Machine Learning quiz. If you have not read them, please do (and cast your votes) before you read part 3 below.

QUIZ, part 3: vote responses and (some) answers

In part 1 I asked which predictions looked “better”: those from model A or those from model B (Figure 1)?

Figure 1

As a reminder, both model A and model B were trained to predict the same labeled facies picked by a geologist on core, shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions.

The question is asked out of context, with no information given about the training process, and or difference in data manipulation (if any) and/or model algorithm used. Very unfair, I know!  And yet, ~78% of 54 respondent clearly indicated their preference for model A. My sense is that this is because model A looks overall smoother and has less of the extra misclassified thin layers.

Response 1

In part 2, I presented the two predictions, this time accompanied by a the confusion matrix for each model (Figure 2).

Figure 2

I asked again which model would be considered better [1] and this was the result:

Response 2a

Although there were far fewer votes (not as robust a statistical sample) I see that the proportion of votes is very similar to that in the previous response, and decidedly in favor of model A, again. However, the really interesting learning, and to me surprising, came from the next answer (Response 2b): about 82% of the 11 respondents believe the performance scores in the confusion matrix to be realistic.

Response 2b

Why was it a surprise? It is now time to reveal the trick…..

…which is that the scores in part 2, shown in the confusion matrices of Figure 2, were calculated on the whole well, for training and testing together!!

A few more details:

  • I used default parameters for both models
  • I used a single 70/30 train/test split (the same random split for both models) with no crossvalidation

which is, in essence, how to NOT do Machine Learning!

In Figure 3, I added a new column on the right of each prediction showing in red which part of the result is merely memorized, and in black which part is interpreted (noise?). Notice that for this particular well (the random 70/30 split was done on all wells together) the percentages are 72.5% and 27.5%.

I’ve also added the proper confusion matrix for each model, which used only the test set. These are more realistic (and poor) results.

Figure 3

So, going back to that last response: again, with 11 votes I don’t have solid statistics, but with that caveat in mind one might argue that this is a way you could be ‘sold’ unrealistic (as in over-optimistic) ML results.

At least you could sell them by being vague about the details to those not familiar with the task of machine classification of rock facies and its difficulties (see for example this paper for a great discussion about resolution limitations inherent  in using logs (machine) as opposed to core (human geologist).

Acknowledgments

A big thank you goes to Jesper (Way of the Geophysicist) for his encouragement and feedback, and for brainstorming with me on how to deliver this post series.


[1] notice that, as pointed out in part 2, model predictions were slightly different from those part 1 because I’d forgotten to set the random seed to be the same in the two pipelines; but not very much, the overall ‘look’ was very much the same.

Machine Learning quiz – part 2 of 3

In my previous post I posted part 1 (of 3) of a Machine Learning quiz. If you have not read that post, please do, cast your vote, then come back and try part 2 below.

QUIZ, part 2

Just as a quick reminder, the image below shows the rock facies predicted from two models, which I just called A and B. Both were trained to predict the same labeled rock facies, picked by a geologist on core, which are shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions.

*** Please notice that the models in this figure are (very slightly) different from part 1 because I’d forgotten to set the random seed to be the same in the two pipelines (yes, it happens, my apology). But they are not so different, so I left the image in part 1 unchanged and just updated this one.

Please answer the first question: which model predicts the labeled facies “better” (visually)?

Now study the performance as summarized in the confusion matrices for each model (the purple arrows indicate to which model each matrix belongs; I’ve highlighted in green the columns where each model does better, based on F1 (you don’t have to agree with my choice), and answer the second question (notice the differences are often a few 1/100s, or just one).

 

Machine Learning quiz – part 1 of 3

Introduction

I’ve been meaning to write about the 2016 SEG Machine Learning Contest for some time. I am thinking of a short and not very structured series (i.e. I’ll jump all over the place) of 2, possibly 3 posts (with the exclusion of this quiz). It will mostly be a revisiting – and extension – of some work that team MandMs (Mark Dahl and I) did, but not necessarily posted. I will touch most certainly on cross-validation, learning curves, data imputation, maybe a few other topics.

Background on the 2016 ML contest

The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:

The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

The nine discrete facies (classes of rocks) are:

For some examples of the work during the contest, you can take a look at the original notebook, one of the submissions by my team, where we used Support Vector Classification to predict the facies, or a submission by the one of the top 4 teams, all of whom achieved the highest scores on the validation data with different combinations of Boosted Trees trained on augmented features alongside the original features.

QUIZ

Just before last Christmas, I run a little fun experiment to resume work with this dataset. I decided to turn the outcome into a quiz.

Below I present the predicted rock facies from two distinct models, which I call A and B. Both were trained to predict the same labeled facies picked by the geologist, which are shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions. Which predictions are “better”?

Please be warned, the question is a trick one. As you can see, I am gently leading you to make a visual, qualitative assessment of “better-ness”, while being absolutely vague about the models and not giving any information about the training process, which is intentional, and – yes! – not very fair. But that’s the whole point of this quiz, which is really a teaser to the series.

Working with ZMAP+ grid files in Python

I just added to the awesome Geoscience I/O library a notebook showing how to use Python to read in an XYZ grid file in legacy ZMAP+ format as numpy array, display it with matplotlib, write it to a plain ASCII XYZ grid text file. I will use the DEM data for Mount St. Helens BEFORE the 1980 eruption, taken from Evan Bianco’s excellent notebook, which I converted to ZMAP+ (that part I will demonstrate in a future post / notebook).

You can get the full notebook on GitHub.

ZMAP plus files have a header section that looks something like this (this one comes from the DEM file I am using):

! -----------------------------------------
! ZMap ASCII grid file
!
! -----------------------------------------
@unnamed HEADER, GRID, 5
15, 1E+030, , 7, 1
468, 327, 0, 326, 0, 467
0. , 0. , 0.
@

where the rows starting with a ! symbol are comments, and the rows in between the two @ symbols store the information about the grid. In particular, the first two entries in the third line of this column are nrows and ncols, the number of rows and columns,  respectively.

The header section is followed by a data section of ncols blocks each containing nrows elements arranged in rows of 5 (given by the last number of the first row with the @ symbol.

OK, let’s get to it.

I break the task of reading in two parts: the first part is to gather from the header the information I will be using. I want from the section between the two @s:

  • the second entry in its second line – the null value
  • the first two entries in its third line – the number of rows and columns, as mentioned
  • the last four entries in its third line – the xmin, xmax, ymin, ymax, respectively

I am making the assumption that the number of commented rows may not be the same in all ZMAP+ files, hence I abandoned my first approach which was based on using getline with specific line numbers. Instead, I used:

  • an if statement to check for the line beginning with @ (usually followed by a grid name) but not @\n; then grab the next two lines
  • another if statement to check whether a line starts with !
  • a counter is increased by one for each line meeting either of the above conditions. This is used later on to skip all lines that do not have data
filename = '../data/Helens_before_XYZ_ZMAP_PLUS.dat'
count = 0
hdr=[]
with open(filename) as h:
    for line in h:
        if line.startswith('!'): 
            count += 1
        if line.startswith('@') and not line.startswith('@\n'):
            count += 1
            hdr.append(next(h)); count += 1
            hdr.append(next(h)); count += 1
            count += 1

which gives:

hdr
>>> ['15, 1E+030, , 7, 1\n', '468, 327, 0, 326, 0, 467\n']

Next I use the line hdr = [x.strip('\n') for x in ','.join(hdr).split(',')] to remove the newline, and the next if / else statement to deal with the null values, if present:

if hdr[1] ==' ': null=np.nan
else: null = float(hdr[1])

Finally, I get the remainder of the information with:

xmin, xmax, ymin, ymax = [ float(x) for x in hdr[7:11]]
grid_rows, grid_cols = [int(y) for y in hdr[5:7]]

Which I can check with:

print(null, xmin, xmax, ymin, ymax, grid_rows, grid_cols)
>>> 1e+30 0.0 326.0 0.0 467.0 468 327

That’s it! I think there is room to come back and write something more efficient later on, but for now I am satisfied this is robust enough to handle headers.

The rest, reading in the data, is downhill, with a first loop to skip the 9 header lines, and a second one to read all elements in a single array:

with open(filename) as f:
    [next(f) for _ in range(count)]                
    for line in f:
        grid = (np.asarray([word for line in f for word in line.split()]).astype(float)).reshape(grid_cols, grid_rows).T
grid[grid==null]=np.nan

Done!!

Working with 3D seismic data in Python using segyio and numpy (mostly)

Between 2 and 3 years ago I started turning my long time passion for image processing, and particularly morphological image processing, to the task of fault segmentation.

At the time I shared my preliminary code, of which I was very happy, in a Jupyter notebook, which you can run interactively at this GitHub repository.

Two areas need improvement to get that initial workflow closer to a production one. The first one is on the image processing and morphology side; I am thinking of including: a better way to clean-up very short faults; pruning to eliminate spurious segments in the skeletonization result; the von Mises distribution instead of standard distribution to filter low dip angles. I know how to improve all those aspects, and have some code snippets sitting around in various locations on my Mac, but I am not quite ready to push for it.

The second area, on the seismic side, is ability to work with 3D data. This has been a sore spot for some time.

Enter segyio, a fast, open-source library, developed precisely to work with SEGY files. To be fair, segyio has been around for some time, as I very well know from being a member of the Software Underground community (swung), but it was only a month or so ago that I started tinkering with it.

This post is mostly to share back with the community what I’ve learned in my very first playground session (with some very helpful tips from Jørgen Kvalsvik, a fellow member of swung, and one of the creators of segyio), which allowed me to create a 3D fault segmentation volume (and have lots of fun in the process) from a similarity (or discontinuity) volume.

The workflow, which you can run interactively at this segyio-notebooks GitHub repository (look for the 01 – Basic tutorial) is summarized pictorially in Figure 1, and comprises the steps below:

  • use segy-io to import two seismic volumes in SEGY file format from the F3 dataset, offshore Netherlands, licensed CC-BY-SA: a similarity volume, and an amplitude volume (with dip steered median filter smoothing applied)
  • manipulate the similarity to create a discontinuity/fault volume
  • create a fault mask and display a couple of amplitude time slices with superimposed faults
  • write the fault volume to SEGY file using segy-io, re-using the headers from the input file
workflow

Figure 1. Naive 3D seismic fault segmentation workflow in Python.

 

Get the notebooks on GitHub (look for the 01 – Basic tutorial)

Feedback is welcome.

 

DISCLAIMER: The steps outlined in the tutorial are not intended as a production-quality fault segmentation workflow. They work reasonably well on the small, clean similarity volume, artfully selected for the occasion, but it is just a simple example.

Upscaling geophysical logs with Python using Pandas and Bruges

With a few hours of work last weekend, I finished putting together a Jupyter notebook tutorial, started at the Geophysics Python sprint 2018, demonstrating how to:

    • Use Agile Scientific’s Welly to load two wells with several geophysical logs
    • Use PandasWelly, and NumPy to: remove all logs except for compressional wave velocity (Vp), shear wave velocity (Vs), and density (RHOB); store the wells in individual DataFrames; make the sampling rate common to both wells; check for null values; convert units from imperial to metric; convert slowness to velocity; add a well name column
    • Split the DataFrame by well using unique values in the well name column
    • For each group/well use Agile Scientific’s Bruges ‘s Backus average to upscale all curves individually
    • Add the upscaled curves back to the DataFrame

Matt Hall, (organizer), told me during a breakfast chat on the first day of the sprint that this tutorial would be a very good to have since it is one of the most requested examples by neophyte users of the Bruges library; I was happy to oblige.

The code for the most important bit, the last two items in the above list, is included below:

# Define parameters for the Backus filter
lb = 40   # Backus length in meters
dz = 1.0  # Log sampling interval in meters

# Do the upscaling work
wells_bk = pd.DataFrame()
grouped = wells['well'].unique()  
for well in grouped:
    new_df = pd.DataFrame()
    Vp = np.array(wells.loc[wells['well'] == well, 'Vp'])
    Vs = np.array(wells.loc[wells['well'] == well, 'Vs'])
    rhob = np.array(wells.loc[wells['well'] == well, 'RHOB'])
    Vp_bks, Vs_bks, rhob_bks = br.rockphysics.backus(Vp, Vs, rhob, lb, dz)
    new_df['Vp_bk'] = Vp_bks
    new_df['Vs_bk'] = Vs_bks
    new_df['rhob_bk'] = rhob_bks
    wells_bk = pd.concat([wells_bk, new_df])

# Add to the input DataFrame
wells_final = (np.concatenate((wells.values, wells_bk.values), axis=1)) 
cols = list(wells) + list(wells_bk) 
wells_final_df = pd.DataFrame(wells_final, columns=cols)

And here is a plot comparing the raw and upscaled Vp and Vs logs for one of the wells:

backus

Please check the notebook if you want to try the full example.

Geophysics Python sprint 2018 – day 2 and beyond, part II

In the last post I wrote about what Volodymyr and I worked on during a good portion of day two of the sprint in October, and continued to work on upon our return to Calgary.

In addition to that I also continued to work on a notebook example, started in day one, demonstrating on how to upscale sonic and density logs from more than one log at a time using Bruges ‘ backusand Panda’s groupby. This will be the focus of a future post.

The final thing I did was to write, and test an error_flag function for Bruges. The function calculates the difference between a predicted and a real curve; it flags errors in prediction if the difference between the curves exceeds a user-defined distance (in standard deviation units) from the mean difference. Another option available is to check whether the curves have opposite slopes (for example one increasing, the other decreasing within a specific interval). The result is a binary error log that can then be used to generate QC plots, to evaluate the performance of the prediction processes in a more (it is my hope) insightful way.

The inspiration for this stems from a discussion over coffee I had 5 or 6 years ago with Glenn Larson, a Geophysicist at Devon Energy, about the limitations of (and alternatives to) using a single global score when evaluating the result of seismic inversion against wireline well logs (the ground truth). I’d been holding that in the back of my mind for years, then finally got to it last Fall.

flag_full

Summary statistics can also be calculated by stratigraphic unit, as demonstrated in the accompanying Jupyter Notebook.

Geophysics Python sprint 2018 – day 2 and beyond, part I

In my last post I wrote about what I did on day one of the Geophysics sprint run by Agile Scientific in Santa Ana two weeks ago.

This post and the next one are about the project Volodymyr and I worked on during day two of the sprint, and continued to work on upon our return to Calgary.

We had read a great notebook by Alessandro Amato del Monte (I recommend browsing his Geophysical notes repo) showing how to reconstruct a velocity log from density with optimized alpha and beta parameters for the Inverse Gardner function, found via scipy.curve_fit.

Inspired by that, we set out with a dual goal:

  • First, we wanted to adapt Alessandro’s optimization idea so that it would work with Bruges‘ Inverse Gardner
  • Second, we wanted to adapt a function from some old work of mine to flag sections of the output velocity log with poor prediction; this would be useful to learn where alpha and beta may need to be tweaked because of changes in the rock lithology or fluid content

I’ll walk you through some of our work. Below are the two functions:

# Alessandro's simple inverse Gardner
def inv_gardner(rho, alpha, beta):
    return (rho/alpha)**(1/beta)

# Bruges' inverse Gardner
def inverse_gardner(rho, alpha=310, beta=0.25, fps=False):
    """
    Computes Gardner's density prediction from P-wave velocity.
    Args:
        rho (ndarray): Density in kg/m^3.
        alpha (float): The factor, 310 for m/s and 230 for fps.
        beta (float): The exponent, usually 0.25.
        fps (bool): Set to true for FPS and the equation will use the typical
            value for alpha. Overrides value for alpha, so if you want to use
            your own alpha, regardless of units, set this to False.
        Returns:
            ndarray: Vp estimate in m/s.
    """
    alpha = 230 if fps else alpha
    exponent = 1 / beta
    factor = 1 / alpha**exponent
    return factor * rho**exponent

They look similarly structured, and take the same arguments. We can test them by passing a single density value and alpha/beta pair.

inv_gardner(2000, 0.39, 0.23)
>>> 1.349846231542594e+16

inverse_gardner(2000, 0.39, 0.23)
>>> 1.3498462315425942e+16

Good. So the next logical step would be to define some model density and velocity data (shamelessly taken from Alessandro’s notebook, except we now use Bruges’ Gardner with S.I. units) and pass the data, and Bruges’ inverse Gardner toscipy.curve_fit to see if it does just work; could it be that simple?

# Make up random velocity and density with Bruges' direct Gardner
vp_test = numpy.linspace(1500, 5500)
rho_test = gardner(vp_test, 310, 0.25)
noise = numpy.random.uniform(0.1, 0.3, vp_test.shape)*1000
rho_test = rho_test + noise

The next block is only slightly different from Alessandro’s notebook. Instead of using all data, we splits both density and velocity into two pairs of arrays: a rho12 and vp2 to optimize foralpha and beta,  a rho1 for calculating “unknown” velocities vp_calc1 further down; the last one, v1, will be used just to show where the real data might have been had we not had to calculate it.

idx = np.arange(len(vp_test))
np.random.seed(3)
spl1 = np.random.randint(0, len(vp_test), 15)
spl2 = np.setxor1d(idx,spl1)
rho1 = rho_test[spl1]
rho2 = rho_test[spl2]
vp1= vp_test[spl1] # this we pretend we do not have 
vp2= vp_test[spl2]

Now, as in Alessandro’s notebook, we pass simple inverse Gardner function to scipy.curve_fit to find optimal alpha and beta parameters, and we printalpha and beta.

popt_synt2, pcov2 = scipy.curve_fit(inv_gardner,rho2, vp2)
print (popt_synt2)
>>> [3.31376056e+02 2.51257203e-01]

Those values seem reasonable, but just to be sure let’s calculate vp_calc1 from rho1 and plot everything to be sure.

vp_calc1 = inv_gardner(rho1, *popt_synt2)

# this is to show the fit line
rho_synt_fit=np.linspace(1, 3000, 50)
vp_synt_fit=inv_gardner(rho_synt_fit, *popt_synt2)

plt.figure(figsize=(10, 10))
plt.plot(rho2,vp2,'or', markersize = 10, label = "fitted points")
plt.plot(rho1,vp1,'ob', markersize = 10, alpha = 0.4, label = "calculated points")
plt.plot(rho1,vp1,'ok', markersize = 10, label = "withheld points")
plt.plot(rho_synt_fit, vp_synt_fit, '-r', lw=2, 
         label='Fit' r'$ V_p=(%.2f / \rho)^{1/%.2f}$' %(popt_synt2[0], 
                                                     popt_synt2[1]))
plt.xlabel('Density rho [kg/m^3]'), plt.xlim(1800, 3000)
plt.ylabel('Velocty Vp [m/s]'), plt.ylim(1000, 6000)
plt.grid()
plt.legend(loc='upper left')
plt.show()

plot

That looks great. Let’s now try the same using Bruges’ Inverse Gardner.

popt_synt2, pcov2 = curve_fit(inverse_gardner, rho2, vp2)
print (popt_synt2)
>>> [1.         0.29525991 1.        ]

That is odd, we do not get the same parameters; additionally, there’s this error message:

../scipy/optimize/minpack.py:794:
OptimizeWarning: Covariance of the parameters could not be estimated 
category=OptimizeWarning)

One possible explanation is that although both inv_gardner and inverse_gardner take three parameters, perhaps scipy.curve_fit does not know to expect it because in the latter alpha and betaare pre-assigned.

The workaround for this was to write a wrapper function to ‘map’ between the call signature of scipy.curve_fit and that of inverse_gardner so that it would be ‘communicated’ to the former explicitly.

def optimize_inverse_gardner(rho, alpha, beta):
    return inverse_gardner(rho, alpha=alpha, beta=beta)

popt_synt2, pcov2 = scipy.curve_fit(optimize_inverse_gardner, 
                                    rho2, vp2) 
print (popt_synt2)
>>> [3.31376060e+02 2.51257202e-01]

Which is the result we wanted.

In the next post we will apply this to some real data and show how to flag areas of poorer results.