I’ve been meaning to write about the 2016 SEG Machine Learning Contest for some time. I am thinking of a short and not very structured series (i.e. I’ll jump all over the place) of 2, possibly 3 posts (with the exclusion of this quiz). It will mostly be a revisiting – and extension – of some work that team MandMs (Mark Dahl and I) did, but not necessarily posted.I will touch most certainly on cross-validation, learning curves, data imputation, maybe a few other topics.
Background on the 2016 ML contest
The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:
The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)
The nine discrete facies (classes of rocks) are:
For some examples of the work during the contest, you can take a look at the original notebook, one of the submissions by my team, where we used Support Vector Classification to predict the facies, or a submission by the one of the top 4 teams, all of whom achieved the highest scores on the validation data with different combinations of Boosted Trees trained on augmented features alongside the original features.
QUIZ
Just before last Christmas, I run a little fun experiment to resume work with this dataset. I decided to turn the outcome into a quiz.
Below I present the predicted rock facies from two distinct models, which I call A and B. Both were trained to predict the same labeled facies picked by the geologist, which are shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions. Which predictions are “better”?
Please be warned, the question is a trick one. As you can see, I am gently leading you to make a visual, qualitative assessment of “better-ness”, while being absolutely vague about the models and not giving any information about the training process, which is intentional, and – yes! – not very fair. But that’s the whole point of this quiz, which is really a teaser to the series.
The dependent/target variable is oil production (measured in tens of barrels of oil per day) from a marine barrier sand. The independent variables are: Gross pay, in meters; Phi-h, porosity multiplied by thickness, with a 3% porosity cut-off; Position within the reservoir (a ranked variable, with 1.0 representing the uppermost geological facies, 2.0 the middle one, 3.0 the lowest one); Pressure draw-down in MPa. Three additional ‘special’ variables are: Random 1 and Random 2, which are range bound and random, and were included in the paper, and Gross pay transform, which I created specifically for this exercise to be highly correlated to Gross pay, by passing Gross pay to a logarithmic function, and then adding a bit of normally distributed random noise.
Correlation matrix with ellipses
I am very pleased with having been able to put together, by the end of it, a good looking scatter matrix that incorporated:
bivariate scatter-plots in the upper triangle, annotated with rank correlation coefficient, confidence interval, and probability of spurious correlation
contours in the lower triangle
shape of the bivariate distributions (KDE) on the diagonal
In a comment to the post, Matt Hall got me thinking about other ways to visualize the correlation coefficient. I did not end up using a colourmap for the facecolour of the plot (although this would probably be relatively easy, in an earlier attempt using hex-bin plots, the colourmap scaling of each plot independently – to account for outliers – proved challenging). But after some digging I found the Biokit library, which comes with a lot of useful visualizations, among which corrplot is exactly what I was looking for. With only a bit of tinkering I was able to produce, shown in Figure 1, a correlation matrix with:
correlation coefficient in upper triangle (colour and intensity indicate whether positive or negative correlation, and its strength, respectively)
bivariate ellipses in the lower triangle (ellipse direction and colour indicates whether positive or negative correlation; ellipticity and colour intensity are proportional to the correlation coefficient)
Figure 1. Correlation matrix using the Biokit library
Also notice that – quite conveniently – the correlation matrix of Figure 1 is reordered with strongly correlated variables adjacent to one another, which facilitates interpretation. This is done using the rank correlation coefficient, with pandas.DataFrame.corr, and Biokit’s corrplot:
The insightful take-away is that with this reordering, the more ‘interesting’ variables, because of strong correlation (as defined in this case by the rank correlation coefficient), are close together and reposition along the diagonal, so we can immediately appreciate that Production, Phi-h, and Gross Pay, plus to a lesser extent position (albeit this one with negative correlation to production) are related to one another. This is a great intuition, and supports up our hypothesis (in an inferential test), backed by physical expectation, that production should be related to those other quantities.
But I think it is time to move away from either Pearson or Spearman correlation coefficient.
Correlation matrix with distance correlation and its p-value
I learned about distance correlation from Thomas when we were starting to work on our 2018 CSEG/CASP Geoconvention talk Data science tools for petroleum exploration and production“. What I immediately liked about distance correlation is that it does not assume a linear relationship between variables, and even more importantly, whereas with Pearson and Spearman a correlation value of zero does not prove independence between any two variables, a distance correlation of zero does mean that there is no dependence between those two variables.
For Python, I used the dcor and dcor.independence.distance_covariance_test from the dcor library (with many thanks to Carlos Ramos Carreño, author of the Python library, who was kind enough to point me to the table of energy-dcor equivalents). So, for example, for one variable pair, we can do this:
So, wanting to apply these tests in a pairwise fashion to all variables, I modified the dist_corr function and corrfunc function from the existing notebook
def dist_corr(X, Y, pval=True, nruns=2000):
""" Distance correlation with p-value from bootstrapping
"""
dc = dcor.distance_correlation(X, Y)
pv = dcor.independence.distance_covariance_test(X, Y, exponent=1.0, num_resamples=nruns)[0]
if pval:
return (dc, pv)
else:
return dc
def corrfunc(x, y, **kws):
d, p = dist_corr(x,y)
#print("{:.4f}".format(d), "{:.4f}".format(p))
if p > 0.1:
pclr = 'Darkgray'
else:
pclr= 'Darkblue'
ax = plt.gca()
ax.annotate("DC = {:.2f}".format(d), xy=(.1, 0.99), xycoords=ax.transAxes, color = pclr, fontsize = 14)
Figure 2. Revised Seaborn pairgrid matrix with distance correlation colored by p-value (gray if p > 0.10, blue if p <= 0.10)
Clustering using distance correlation
I really like the result in Figure 2. However, I want to have more control on how the pairwise plots are arranged; a bit like in Figure 1, but using my metric of choice, which would be again the distance correlation. To do that, I will first show how to create a square matrix of distance correlation values, then I will look at clustering of the variables; but rather than passing the raw data to the algorithm, I will pass the distance correlation matrix. Get ready for a ride!
For the first part, making the square matrix of distance correlation values, I adapted the code from this brilliant SO answer on Euclidean distance (I recommend you read the whole answer):
# Create the distance method using distance_correlation
distcorr = lambda column1, column2: dcor.distance_correlation(column1, column2)
# Apply the distance method pairwise to every column
rslt = data.apply(lambda col1: data.apply(lambda col2: distcorr(col1, col2)))
# check output
pd.options.display.float_format = '{:,.2f}'.format
rslt
Table I. Distance correlation matrix.
The matrix in Table I looks like what I wanted, but let’s calculate a couple of values directly, to be sure:
Now I am going to take a bit of a detour, and use that matrix, rather than the raw data, to cluster the variables, and then display the result with a heat-map and accompanying dendrograms. That can be done with Biokit’s heatmap:
Figure 3. Biokit heatmap with dendrograms, using correlation distance matrix
That’s very nice, but please notice how much ‘massaging’ it took: first, I had to flip the axis for the dendrogram for the rows (on the left) because it would be incorrectly reversed by default; and then, I had to shorten the name of Gross-pay transform so that its row label would not end up under the colorbar (also, the diagonal is flipped upside down, and I could not reverse it or else the colum labels would go under the top dendrogram). I suppose the latter too could be done on the Matplotlib side, but why bother when we can get it all done much more easily with Seaborn? Plus, you will see below that I actually have to… but I’m putting the cart before the horses…. here’s the Seaborn code:
and the result in Figure 4. We really get everything for free!
Figure 4. Seaborn clustermap, using correlation distance matrix
Before moving to the final fireworks, a bit of interpretation: again what comes up is that Production, Phi-h, Gross Pay, and Gross pay transform group together, as in Figure 1, but now the observation is based on a much more robust metric. Position is not as ‘close’, it is ends up in a different cluster, although its distance correlation from Production is 0.45, and the p-value is <0.10, hence it is still a relevant variable.
I think this is as far as I would go with interpretation. It does also show me that Gross pay, and Gross pay transform are very much related to one another, with high dependence, but it does still not tell me, in the context of variable selection for predicting Production, which one I should drop: I only know it should be Gross pay transform because I created it myself. For proper variable selection I will look at techniques like Least Absolute Shrinkage and Selection Operator (LASSO, which Thomas has showcased in his R notebook) and Recursive Feature Elimination (I’ll be testing Sebastan Raschka‘s Sequential Feature Selector from the mlxtend library).
Correlation matrix with distance correlation, p-value, and plots rearranged by clustering
I started this whole dash by saying I wanted to control how the pairwise plots were arranged in the scatter matrix, and that to do so required use of Seaborn. Indeed, it turns out the reordered row/column indices (in our case they are the same) can be easily accessed:
Figure 5. Revised Seaborn pairgrid matrix with distance correlation colored by p-value (gray if p > 0.10, blue if p <= 0.10), and plots rearranged by clustering results
I just added to the awesome Geoscience I/O library a notebook showing how to use Python to read in an XYZ grid file in legacy ZMAP+ format as numpy array, display it with matplotlib, write it to a plain ASCII XYZ grid text file. I will use the DEM data for Mount St. Helens BEFORE the 1980 eruption, taken from Evan Bianco’s excellent notebook, which I converted to ZMAP+ (that part I will demonstrate in a future post / notebook).
where the rows starting with a ! symbol are comments, and the rows in between the two @ symbols store the information about the grid. In particular, the first two entries in the third line of this column are nrows and ncols, the number of rows and columns, respectively.
The header section is followed by a data section of ncols blocks each containing nrows elements arranged in rows of 5 (given by the last number of the first row with the @ symbol.
OK, let’s get to it.
I break the task of reading in two parts: the first part is to gather from the header the information I will be using. I want from the section between the two @s:
the second entry in its second line – the null value
the first two entries in its third line – the number of rows and columns, as mentioned
the last four entries in its third line – the xmin, xmax, ymin, ymax, respectively
I am making the assumption that the number of commented rows may not be the same in all ZMAP+ files, hence I abandoned my first approach which was based on using getline with specific line numbers. Instead, I used:
an if statement to check for the line beginning with @ (usually followed by a grid name) but not @\n; then grab the next two lines
another if statement to check whether a line starts with !
a counter is increased by one for each line meeting either of the above conditions. This is used later on to skip all lines that do not have data
filename = '../data/Helens_before_XYZ_ZMAP_PLUS.dat'
count = 0
hdr=[]
with open(filename) as h:
for line in h:
if line.startswith('!'):
count += 1
if line.startswith('@') and not line.startswith('@\n'):
count += 1
hdr.append(next(h)); count += 1
hdr.append(next(h)); count += 1
count += 1
Next I use the line hdr = [x.strip('\n') for x in ','.join(hdr).split(',')] to remove the newline, and the next if / else statement to deal with the null values, if present:
if hdr[1] ==' ': null=np.nan
else: null = float(hdr[1])
Finally, I get the remainder of the information with:
xmin, xmax, ymin, ymax = [ float(x) for x in hdr[7:11]]
grid_rows, grid_cols = [int(y) for y in hdr[5:7]]
That’s it! I think there is room to come back and write something more efficient later on, but for now I am satisfied this is robust enough to handle headers.
The rest, reading in the data, is downhill, with a first loop to skip the 9 header lines, and a second one to read all elements in a single array:
with open(filename) as f:
[next(f) for _ in range(count)]
for line in f:
grid = (np.asarray([word for line in f for word in line.split()]).astype(float)).reshape(grid_cols, grid_rows).T
grid[grid==null]=np.nan
In this post I want to highlight on one aspect in particular: doing data exploration visually, but also quantitatively with inferential statistic tests.
Like all data scientists (professional, or in the making, and I ‘park’ myself for now in the latter bin), a scatter matrix is often the first thing I produce, after data cleanup, to look for obvious pairwise relationships and trends between variables. And I really love the flexibility, and looks of seaborn‘s pairgrid.
The scatter matrix in Figure 1 includes bivariate scatter-plots in the upper triangle, contours in the lower triangle, shape of the bivariate distributions on the diagonal.
It looks terrific. But, as I said, I’d like to show how to add to the individual scatterplots some useful annotation. These are the ones I typically focus on:
I also need a function to calculate the critical r, that is, the value of correlation coefficient above which one can rule out chance as an explanation for a relationship:
where a is equal to alpha/2, alpha being the level of significance,or the chance of being wrong that one accepts to live with, and nwells-2 is equivalent to the degrees of freedom.
Finally, I need a utility function (adapted from this Stack Overflow answer) to calculate on the fly and annotate the individual scatterplots with the CC, CI, and P.
Between 2 and 3 years ago I started turning my long time passion for image processing, and particularly morphological image processing, to the task of fault segmentation.
At the time I shared my preliminary code, of which I was very happy, in a Jupyter notebook, which you can run interactively at this GitHub repository.
Two areas need improvement to get that initial workflow closer to a production one. The first one is on the image processing and morphology side; I am thinking of including: a better way to clean-up very short faults; pruning to eliminate spurious segments in the skeletonization result; the von Mises distribution instead of standard distribution to filter low dip angles. I know how to improve all those aspects, and have some code snippets sitting around in various locations on my Mac, but I am not quite ready to push for it.
The second area, on the seismic side, is ability to work with 3D data. This has been a sore spot for some time.
Enter segyio, a fast, open-source library, developed precisely to work with SEGY files. To be fair, segyio has been around for some time, as I very well know from being a member of the Software Underground community (swung), but it was only a month or so ago that I started tinkering with it.
This post is mostly to share back with the community what I’ve learned in my very first playground session (with some very helpful tips from Jørgen Kvalsvik, a fellow member of swung, and one of the creators of segyio), which allowed me to create a 3D fault segmentation volume (and have lots of fun in the process) from a similarity (or discontinuity) volume.
The workflow, which you can run interactively at this segyio-notebooks GitHub repository (look for the 01 – Basic tutorial) is summarized pictorially in Figure 1, and comprises the steps below:
use segy-io to import two seismic volumes in SEGY file format from the F3 dataset, offshore Netherlands, licensed CC-BY-SA: a similarity volume, and an amplitude volume (with dip steered median filter smoothing applied)
manipulate the similarity to create a discontinuity/fault volume
create a fault mask and display a couple of amplitude time slices with superimposed faults
write the fault volume to SEGY file using segy-io, re-using the headers from the input file
Figure 1. Naive 3D seismic fault segmentation workflow in Python.
DISCLAIMER: The steps outlined in the tutorial are not intended as a production-quality fault segmentation workflow. They work reasonably well on the small, clean similarity volume, artfully selected for the occasion, but it is just a simple example.
Use Agile Scientific’s Welly to load two wells with several geophysical logs
Use Pandas, Welly, and NumPy to: remove all logs except for compressional wave velocity (Vp), shear wave velocity (Vs), and density (RHOB); store the wells in individual DataFrames; make the sampling rate common to both wells; check for null values; convert units from imperial to metric; convert slowness to velocity; add a well name column
Split the DataFrame by well using unique values in the well name column
For each group/well use Agile Scientific’s Bruges ‘s Backus average to upscale all curves individually
Add the upscaled curves back to the DataFrame
Matt Hall, (organizer), told me during a breakfast chat on the first day of the sprint that this tutorial would be a very good to have since it is one of the most requested examples by neophyte users of the Bruges library; I was happy to oblige.
The code for the most important bit, the last two items in the above list, is included below:
# Define parameters for the Backus filter
lb = 40 # Backus length in meters
dz = 1.0 # Log sampling interval in meters
# Do the upscaling work
wells_bk=pd.DataFrame()grouped=wells['well'].unique() forwellingrouped:new_df=pd.DataFrame()Vp=np.array(wells.loc[wells['well']==well,'Vp'])Vs=np.array(wells.loc[wells['well']==well,'Vs'])rhob=np.array(wells.loc[wells['well']==well,'RHOB'])Vp_bks,Vs_bks,rhob_bks=br.rockphysics.backus(Vp,Vs,rhob,lb,dz)new_df['Vp_bk']=Vp_bksnew_df['Vs_bk']=Vs_bksnew_df['rhob_bk']=rhob_bkswells_bk=pd.concat([wells_bk,new_df])
# Add to the input DataFrame
wells_final= (np.concatenate((wells.values, wells_bk.values), axis=1))
cols=list(wells) +list(wells_bk)
wells_final_df=pd.DataFrame(wells_final, columns=cols)
And here is a plot comparing the raw and upscaled Vp and Vs logs for one of the wells:
Please check the notebook if you want to try the full example.
In the last post I wrote about what Volodymyr and I worked on during a good portion of day two of the sprint in October, and continued to work on upon our return to Calgary.
In addition to that I also continued to work on a notebook example, started in day one, demonstrating on how to upscale sonic and density logs from more than one log at a time using Bruges ‘ backusand Panda’s groupby. This will be the focus of a future post.
The final thing I did was to write, and test an error_flag function for Bruges. The function calculates the difference between a predicted and a real curve; it flags errors in prediction if the difference between the curves exceeds a user-defined distance (in standard deviation units) from the mean difference. Another option available is to check whether the curves have opposite slopes (for example one increasing, the other decreasing within a specific interval). The result is a binary error log that can then be used to generate QC plots, to evaluate the performance of the prediction processes in a more (it is my hope) insightful way.
The inspiration for this stems from a discussion over coffee I had 5 or 6 years ago with Glenn Larson, a Geophysicist at Devon Energy, about the limitations of (and alternatives to) using a single global score when evaluating the result of seismic inversion against wireline well logs (the ground truth). I’d been holding that in the back of my mind for years, then finally got to it last Fall.
Summary statistics can also be calculated by stratigraphic unit, as demonstrated in the accompanyingJupyter Notebook.
This post and the next one are about the project Volodymyr and I worked on during day two of the sprint, and continued to work on upon our return to Calgary.
First, we wanted to adapt Alessandro’s optimization idea so that it would work with Bruges‘ Inverse Gardner
Second, we wanted to adapt a function from some old work of mine to flag sections of the output velocity log with poor prediction; this would be useful to learn where alpha and beta may need to be tweaked because of changes in the rock lithology or fluid content
I’ll walk you through some of our work. Below are the two functions:
# Alessandro's simple inverse Gardner
def inv_gardner(rho, alpha, beta):
return (rho/alpha)**(1/beta)
# Bruges' inverse Gardner
def inverse_gardner(rho, alpha=310, beta=0.25, fps=False):
"""
Computes Gardner's density prediction from P-wave velocity.
Args:
rho (ndarray): Density in kg/m^3.
alpha (float): The factor, 310 for m/s and 230 for fps.
beta (float): The exponent, usually 0.25.
fps (bool): Set to true for FPS and the equation will use the typical
value for alpha. Overrides value for alpha, so if you want to use
your own alpha, regardless of units, set this to False.
Returns:
ndarray: Vp estimate in m/s.
"""
alpha = 230 if fps else alpha
exponent = 1 / beta
factor = 1 / alpha**exponent
return factor * rho**exponent
They look similarly structured, and take the same arguments. We can test them by passing a single density value and alpha/beta pair.
Good. So the next logical step would be to define some model density and velocity data (shamelessly taken from Alessandro’s notebook, except we now use Bruges’ Gardner with S.I. units) and pass the data, and Bruges’ inverse Gardner toscipy.curve_fit to see if it does just work; could it be that simple?
# Make up random velocity and density with Bruges' direct Gardner
vp_test = numpy.linspace(1500, 5500)
rho_test = gardner(vp_test, 310, 0.25)
noise = numpy.random.uniform(0.1, 0.3, vp_test.shape)*1000
rho_test = rho_test + noise
The next block is only slightly different from Alessandro’s notebook. Instead of using all data, we splits both density and velocity into two pairs of arrays: a rho12 and vp2 to optimize foralpha and beta, a rho1 for calculating “unknown” velocities vp_calc1 further down; the last one, v1, will be used just to show where the real data might have been had we not had to calculate it.
idx = np.arange(len(vp_test))
np.random.seed(3)
spl1 = np.random.randint(0, len(vp_test), 15)
spl2 = np.setxor1d(idx,spl1)
rho1 = rho_test[spl1]
rho2 = rho_test[spl2]
vp1= vp_test[spl1] # this we pretend we do not have
vp2= vp_test[spl2]
Now, as in Alessandro’s notebook, we pass simple inverse Gardner function to scipy.curve_fit to find optimal alpha and beta parameters, and we printalpha and beta.
That is odd, we do not get the same parameters; additionally, there’s this error message:
../scipy/optimize/minpack.py:794:
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
One possible explanation is that although both inv_gardner and inverse_gardner take three parameters, perhaps scipy.curve_fit does not know to expect it because in the latter alpha and betaare pre-assigned.
The workaround for this was to write a wrapper function to ‘map’ between the call signature of scipy.curve_fit and that of inverse_gardner so that it would be ‘communicated’ to the former explicitly.