# Machine Learning in Geoscience with Scikit-learn. Part 2: inferential statistics and domain knowledge to select features for oil prediction

In the first post of this series I showed how to use Pandas, Seaborn, and Matplotlib to:

• test, clean up, and summarize the data
• start looking for relationships between variables using scatterplots and correlation coefficients

In this second post, I will expand on the latter point by introducing some tests and visualizations that will help highlight the possible criteria for choosing some variables, and dropping others. All in Python.

I will use a different dataset than that in the previous post. This one is from the paper “Many correlation coefficients, null hypotheses, and high value“(Lee Hunt, CSEG Recorder, December 2013).

The target to be predicted is oil production from a marine barrier sand. We have measured production (in tens of barrels per day) and 7 unknown (initially) predictors, at 21 wells.

Hang on tight, and read along, because it will be a wild ride!

I will show how to:

1) automatically flag linearly correlated predictors, so we can decide which might be dropped. In the example below (a matrix of pair-wise correlation coefficients between variables) we see that X2, and X7, the second and third best individual predictors of production (shown in the bottom row) are also highly correlated to X1, the best overall predictor.

2) automatically flag predictors that fail a critical r test

3) create a table to assess the probability that a certain correlation is spurious, in other words the probability of getting at least the correlation coefficient we got with our the sample, or even higher, purely by chance.

I will not recommend to run these tests and apply the criteria blindly. Rather, I will suggest how to use them to learn more about the data, and in conjunction with domain knowledge about the problem at hand (in this case oil production), make more informed choices about which variables should, and which should not be used.

And, of course, I will show how to make the prediction.

Have fun reading: get the Jupyter notebook on GitHub.

# Compare lists from text files using Matlab – an application for resource exploration

#### INTRODUCTION

With today’s post I would like to share a Matlab script I used often to compare lists of ID numbers stored in separate text files. The ID numbers can be of anything: oil and gas wells, mining diamond drill hole locations, gravity or resistivity measurement stations, outcrop locations, you name it. And the script will handle any combination of ASCII characters (numbers, letters, etcetera).

I included below here some test files for you to try with the code, which is in the next section. Please refer to the comment section in the code for usage and file description. Have fun, and if you try it on your lists, let me know how it works for you.

#### TEST FILES

These files are in doc format; they need to be downloaded and saved as plain txt files. Test1 and Test2 contain 2 short lists of (fake) Canadian Oil and Gas wells; the former is the result of a search using a criterion (wells inside a polygonal area), the latter is a list of wells that have digital wireline logs. Test3 and Test4 are short lists of (fake) diamond drill holes.

*** Notice that script is setup to compare list in Test2.txt against list in Test1.txt and not the other way around. If using this on your own lists, you will have to decide in advance which subset of wells you are interested in.

Test1

Test2

Test3

Test4

#### THE SCRIPT

Here is the code: