I will be honest. I have been very busy and unable to write much. Specifically on the latest advances in large language models and their applications, I had forced myself to stay on the fence. Yes, read about it, stay on top of the developments, but leave the experimenting to others, for the time being. Then two things changed my mind: first, I started working through David Mertz’ book Regular Expression Puzzles and AI Coding Assistants (very superficially, admittedly, as regular expressions are not my forte, but the book is awesome); second, I chanced on the article Why everyone should try GPT-4, even the CEO by Cassie Kozyrkov.
Here are the things I am actively playing with, right now. Soon I will start documenting these explorations with a new series of short blog posts:
Writing an essay in philosophy based on a specific, complex prompt
Act as baking assistant
Write an imaginary dialogue between specific historic and fictional characters
Demonstrate geometry theorem
Perform arithmetic operations
Act as song translation assistant
Write poem with specific metric, and music in a specific composer’s style
Act as coding assistant
Solve puzzles
Play a live tic tac toe and/or chess game
Detect logical fallacies
Help with complex numerical estimations given a set of rules
2016 Machine learning contest – Society of Exploration Geophysicists
In a previous post I showed how to use pandas.isnull to find out, for each well individually, if a column has any null values, and sum to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:
for well, data in training_data.groupby('Well Name'):
print(well)
print (data.isnull().values.any())
print (data.isnull().sum(), '\n')
Simple and quick, the output showed met that – for example – the well ALEXANDER D is missing 466 samples from the PE log:
ALEXANDER D
True
Facies 0
Formation 0
Well Name 0
Depth 0
GR 0
ILD_log10 0
DeltaPHI 0
PHIND 0
PE 466
NM_M 0
RELPOS 0
dtype: int64
A more appealing and versatile alternative, which I discovered after the contest, comes with the matrix function form the missingno library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a missingno matrix plot.
for well, data in training_data.groupby('Well Name'):
msno.matrix(data, color=(0., 0., 0.45))
fig = plt.gcf()
fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length
axes=fig.get_axes()
axes[0].set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')
In each of the following plots, the sparklines at the right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset. This to me is a much more compelling and informative way to inspect log data as it shows the data range where data is missing. The well length is also annotated on the bottom left, by which information I learned that Recruit F9 is much shorter than the other wells. And to ensure this is reinforced I introduced a line to modify each plot so that its height reflects the length. I really like it!
2020 Machine Predicted Lithology – FORCE
Since I am taking part in this year’s FORCE Machine Predicted Lithology challenge, I decided to take the above visualization up a notch. Using Ipywidget’s interactive and a similar logic (in this case data['WELL'].unique()) the tool below allows browsing wells using a Select widget and check the chosen well’s curves completeness, on the fly. You can try the tool in this Jupyter notebook.
In a second Jupyter notebook on the other hand, I used missingno matrix to make a quick visual summary plot of the entire dataset completion, log by log (all wells together):
Then, to explore in more depth the data completion, below I also plotted the library’s dendrogram plot. As explained in the library’s documentation, The dendrogram uses a hierarchical clustering algorithm (courtesy of Scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.
I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.
Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.
Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.
Although I have more limited time now, compared to 2016, I am very excited to be participating in the 2020 FORCE Machine Predicted Lithology challenge. Most new work and blog posts will be about this new contest instead of the 2016 one.
The first step after loading the dataset is to create a Pandas DataFrame. With the describe method I get a lot of information for free:
Indeed, from the the first row in the summary I learn that about 20% of samples in the photoelectric effect column PE are missing.
I can use pandas.isnull to tell me, for each well, if a column has any null values, and sum to get the number of null values missing, again for each column.
for well in training_data['Well Name'].unique():
print(well)
w = training_data.loc[training_data['Well Name'] == well]
print (w.isnull().values.any())
print (w.isnull().sum(), '\n')
Simple and quick, the output tells met, for example, that the well ALEXANDER D is missing 466 PE samples, and Recruit F9 is missing 12.
However, the printout is neither easy, nor pleasant to read, as it is a long list like this:
SHRIMPLIN
False
Facies 0
Formation 0
Well Name 0
Depth 0
GR 0
ILD_log10 0
DeltaPHI 0
PHIND 0
PE 0
NM_M 0
RELPOS 0
dtype: int64
ALEXANDER D
True
Facies 0
Formation 0
Well Name 0
Depth 0
GR 0
ILD_log10 0
DeltaPHI 0
PHIND 0
PE 466
NM_M 0
RELPOS 0
dtype: int64
Recruit F9
True
Facies 0
Formation 0
Well Name 0
Depth 0
GR 0
ILD_log10 0
DeltaPHI 0
PHIND 0
PE 12
NM_M 0
RELPOS 0
dtype: int64
...
...
Log quality tests
For this last part I’ll use Agile Scientific Welly library. Truth be told, I know ahead of time form participating in the contest that the logs in this project are relatively clean and well behaved, apart from the missing values. However, I still wanted to show how to check the quality of some of the curves.
First I need to create aliases, and a dictionary of tests (limited to a few), with:
And with those I can create new Well objects from scratch, and for each of them add the logs as curves and run the tests; then I can either show an HTML table with the test results, or save to a csv file (or both).
from IPython.display import HTML, display
for well in training_data['Well Name'].unique():
w = training_data.loc[training_data['Well Name'] == well]
wl = welly.Well({})
for i, log in enumerate(list(w)):
if not log in ['Depth', 'Facies', 'Formation', 'Well Name', 'NM_M', 'RELPOS']:
p = {'mnemonic': log}
wl.data[log] = welly.Curve(w[log], basis=w['Depth'].values, params=p)
# show test results in HTML table
print('\n', '\n')
print('WELL NAME: ' + well)
print('CURVE QUALITY SUMMARY')
html = wl.qc_table_html(tests, alias=alias)
display(HTML(html, metadata = {'title': well,}))
# Save results to csv file
r = wl.qc_data(tests, alias=alias)
hp = pd.DataFrame.from_dict(r, orient = 'index')
hp.to_csv(str(well) + '_curve_quality_summary.csv', sep=',')
From those I can see that, apart from the issues with the PE log, GR has some high values in SHRIMPLIN, and so on…
All of the above is critical to determine the data imputation strategy, which is the topic of one of the next posts; but first in the next post I will use a number of visualizations of the data, to examine its distribution by well and by facies, and to explore relationships among variables.
After wetting (hopefully) your appetite with the Machine Learning quiz / teaser I am now moving on to a series of posts that I decided to title “Geoscience Machine Learning bits and bobs”.
OK, BUT fist of all, what does ‘bits and bobs‘ mean? It is a (mostly) British English expression that means “a lot of small things”.
Is it a commonly used expression? If you are curious enough you can read this post about it on the Not one-off British-ismsblog. Or you can just look at the two Google Ngram plots below: the first is my updated version of the one in the post, comparing the usage of the expression in British vs. US English; the second is a comparison of its British English to that of the more familiar “bits and pieces” (not exactly the same according to the author of the blog, but the Cambridge Dictionary seems to contradict the claim).
I’ve chosen this title because I wanted to distill, in one spot, some of the best collective bits of Machine Learning that came out during, and in the wake of the 2016 SEG Machine Learning contest, including:
The best methods and insights from the submissions, particularly the top 4 teams
Things that I learned myself, during and after the contest
Things that I learned from blog posts and papers published after the contest
I will touch on a lot of topics but I hope that – in spite of the title’s pointing to a random assortment of things – what I will have created in the end is a cohesive blog narrative and a complete, mature Machine Learning pipeline in a Python notebook.
*** September 2020 UPDATE ***
Although I have more limited time these days, compared to 2016, I am very excited to be participating in the 2020 FORCE Machine Predicted Lithology challenge. Most new work and blog posts will be about this new contest instead of the 2016 one.
***************************
Some background on the 2016 ML contest
The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:
The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)
The nine discrete facies (classes of rocks) are:
Tentative topics for this series
List of previous works (in this post)
Data inspection
Data visualization
Data sufficiency
Data imputation
Feature augmentation
Model training and evaluation
Connecting the bits: a full pipeline
List of previous works (comprehensive, to the best of my knowledge)
In each post I will make a point to explicitly reference whether a particular bit (or a bob) comes from a submitted notebook by a team, a previously unpublished notebook of mine, a blog post, or a paper.
However, I’ve also compiled below a list of all the published works, for those that may be interested.
In my last two posts I published part 1 and part 2 of this Machine Learning quiz. If you have not read them, please do (and cast your votes) before you read part 3 below.
QUIZ, part 3: vote responses and (some) answers
In part 1 I asked which predictions looked “better”: those from model A or those from model B (Figure 1)?
Figure 1
As a reminder, both model A and model B were trained to predict the same labeled facies picked by a geologist on core, shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions.
The question is asked out of context, with no information given about the training process, and or difference in data manipulation (if any) and/or model algorithm used. Very unfair, I know! And yet, ~78% of 54 respondent clearly indicated their preference for model A. My sense is that this is because model A looks overall smoother and has less of the extra misclassified thin layers.
Response 1
In part 2, I presented the two predictions, this time accompanied by a the confusion matrix for each model (Figure 2).
Figure 2
I asked again which model would be considered better [1] and this was the result:
Response 2a
Although there were far fewer votes (not as robust a statistical sample) I see that the proportion of votes is very similar to that in the previous response, and decidedly in favor of model A, again. However, the really interesting learning, and to me surprising, came from the next answer (Response 2b): about 82% of the 11 respondents believe the performance scores in the confusion matrix to be realistic.
Response 2b
Why was it a surprise? It is now time to reveal the trick…..
…which is that the scores in part 2, shown in the confusion matrices of Figure 2, were calculated on the whole well, for training and testing together!!
A few more details:
I used default parameters for both models
I used a single 70/30 train/test split (the same random split for both models) with no crossvalidation
which is, in essence, how to NOT do Machine Learning!
In Figure 3, I added a new column on the right of each prediction showing in red which part of the result is merely memorized, and in black which part is interpreted (noise?). Notice that for this particular well (the random 70/30 split was done on all wells together) the percentages are 72.5% and 27.5%.
I’ve also added the proper confusion matrix for each model, which used only the test set. These are more realistic (and poor) results.
Figure 3
So, going back to that last response: again, with 11 votes I don’t have solid statistics, but with that caveat in mind one might argue that this is a way you could be ‘sold’ unrealistic (as in over-optimistic) ML results.
At least you could sell them by being vague about the details to those not familiar with the task of machine classification of rock facies and its difficulties (see for example this paper for a great discussion about resolution limitations inherent in using logs (machine) as opposed to core (human geologist).
Acknowledgments
A big thank you goes to Jesper (Way of the Geophysicist) for his encouragement and feedback, and for brainstorming with me on how to deliver this post series.
[1] notice that, as pointed out in part 2, model predictions were slightly different from those part 1 because I’d forgotten to set the random seed to be the same in the two pipelines; but not very much, the overall ‘look’ was very much the same.
In my previous post I posted part 1 (of 3) of a Machine Learning quiz. If you have not read that post, please do, cast your vote, then come back and try part 2 below.
QUIZ, part 2
Just as a quick reminder, the image below shows the rock facies predicted from two models, which I just called A and B. Both were trained to predict the same labeled rock facies, picked by a geologist on core, which are shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions.
*** Please notice that the models in this figure are (very slightly) different from part 1 because I’d forgotten to set the random seed to be the same in the two pipelines (yes, it happens, my apology). But they are not so different, so I left the image in part 1 unchanged and just updated this one.
Please answer the first question: which model predicts the labeled facies “better” (visually)?
Now study the performance as summarized in the confusion matrices for each model (the purple arrows indicate to which model each matrix belongs; I’ve highlighted in green the columns where each model does better, based on F1 (you don’t have to agree with my choice), and answer the second question (notice the differences are often a few 1/100s, or just one).
I’ve been meaning to write about the 2016 SEG Machine Learning Contest for some time. I am thinking of a short and not very structured series (i.e. I’ll jump all over the place) of 2, possibly 3 posts (with the exclusion of this quiz). It will mostly be a revisiting – and extension – of some work that team MandMs (Mark Dahl and I) did, but not necessarily posted.I will touch most certainly on cross-validation, learning curves, data imputation, maybe a few other topics.
Background on the 2016 ML contest
The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:
The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)
The nine discrete facies (classes of rocks) are:
For some examples of the work during the contest, you can take a look at the original notebook, one of the submissions by my team, where we used Support Vector Classification to predict the facies, or a submission by the one of the top 4 teams, all of whom achieved the highest scores on the validation data with different combinations of Boosted Trees trained on augmented features alongside the original features.
QUIZ
Just before last Christmas, I run a little fun experiment to resume work with this dataset. I decided to turn the outcome into a quiz.
Below I present the predicted rock facies from two distinct models, which I call A and B. Both were trained to predict the same labeled facies picked by the geologist, which are shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions. Which predictions are “better”?
Please be warned, the question is a trick one. As you can see, I am gently leading you to make a visual, qualitative assessment of “better-ness”, while being absolutely vague about the models and not giving any information about the training process, which is intentional, and – yes! – not very fair. But that’s the whole point of this quiz, which is really a teaser to the series.
Understanding classification with Support Vector Machines
Support Vector Machines are a popular type of algorithm used in classification, which is the process of “…identifying to which of a set of categories (sub-populations) a new observation belongs (source: Wikipedia).
In classification, the output variable is a category, for example ‘sand’, or ‘shale’, and the main task of the process is the creation of a dividing boundary between the classes. This boundary will be a line in a bi-dimensional space (only two features used to classify), a surface in a three dimensional space (three features), and a hyper-plane in a higher- dimensional space. In this article I will use interchangeably the terms hyper-plane, boundary, and decision surface.
Defining the boundary may sound like a simple task, especially with two features (a bidimensional scatterplot), but it underlines the important concept of generalization, as pointed out by Jake VanderPlas in his Introduction to Scikit-Learn, because ”… in drawing this separating line, we have learned a model which can generalize to new data: if you were to drop a new point onto the plane which is unlabeled, this algorithm could now predict…” the class it belongs to.
Let’s use a toy classification problem to understand in more detail how in practice SVMs achieve the class separation and find the hyperplane. In the figure below I show an idealized version (with far fewer points) of a Vp/Vs ratio versus P-impedance crossplot from Amato del Monte (2017, Seismic rock physics, tutorial on The Leading Edge). I’ve added three possible boundaries (dashed lines) separating the two classes.
Each boundary is valid, but are they equally good? Well, for the SVM classifier, they are not because the classifier looks for the boundary with the largest distance from the nearest point in either of the classes.
These points, called Support Vectors, are the most representative of each class, and typically the most difficult to classify. They are also the only ones that matter; if a Support Vector is moved, the boundary will also move. However, if any other point is moved, provided that it is not moved into the margin or across the boundary, it would have no effect on the boundary. This makes SVM classifiers insensitive to outliers (points very far away from the rest of the points in their class and from the boundary) and also less memory intensive than other classifiers (for example, the perceptron). The process of finding this boundary is referred to as “maximizing the margin”, where the margin is a corridor with no data points between the boundary and the support vectors. The larger this buffer, the lower the generalization error; conversely, small margins are almost invariably associated with over-fitting. We will see more on this in a subsequent section.
So, to go back to the question, which of the three proposed boundaries is the best one (and by “best” I am referring to the one that will generalize better to unseen data)? Based on what we’ve learned so far, it would have to be the green boundary. Indeed, the orange one is so close to its support vectors (the two points circled with orange) that it leaves virtually no margin; the purple boundary is slightly better (the support vectors are the points circled with purple) but its margin is still quite small compared to the green boundary.
Maximizing the margin is the goal of the SVM classifier, and it is a constrained optimization problem. I refer interested readers to Hearst (1998, Support Vector Machines, IEEE Intelligent Systems); however, I will quote a definition from that paper (with reference to Figure 1 and accompanying text) as it yields further understanding: “… the optimal hyper-plane is orthogonal to the shortest line connecting the convex hulls of the two classes, and intersects it half way”.
In the inset in the figure, I zoomed closer to the 4 points near the green boundary; I’ve also drawn the convex hulls for the classes, the margin, and the shortest orthogonal line, which is bisected by the hyper-plane. I have selected (by hand) the best hyper-plane already (the green one), but if you can imagine rotating a line to span all possible orientations in the empty space close to the two classes without intersecting either of the hulls and find the one with the largest margin, you’ve just done quadratic optimization in your head. Moreover, you’ve turned a crossplot into a decision surface (quoted from Sebastian Thrun, Intro to Machine Learning, Udacity 120 course).
If you are interested in learning more about Support Vector Machines in an intuitive way, and then how to try classification in practice (using Python and the Scikit-learn library), read the full article here, check the GitHub repo, then read How good is what?(blog post by Evan Bianco of Agile Scientific) for an example and DIY evaluation of classifier performance.