If you have not read Geoscience Machine Learning bits and bobs – introduction, please do so first as I go through the objective and outline of this series, as well as a review of the dataset I will be using, which is from the 2016 SEG Machine LEarning contest.

* September 2020 UPDATE *

Although I have more limited time now, compared to 2016, I am very excited to be participating in the 2020 FORCE Machine Predicted Lithology challenge. Most new work and blog posts will be about this new contest instead of the 2016 one.

***************************

OK, let’s begin!

With each post, I will add a new notebook to the GitHub repo here. The notebook that goes with this post is called 01 – Data inspection.

Data inspection

The first step after loading the dataset is to create a Pandas DataFrame. With the describe method I get a lot of information for free:

Indeed, from the the first row in the summary I learn that about 20% of samples in the photoelectric effect column PE are missing.

I can use pandas.isnull to tell me, for each well, if a column has any null values, and sum to get the number of null values missing, again for each column.

for well in training_data['Well Name'].unique():
    print(well)
    w = training_data.loc[training_data['Well Name'] == well] 
    print (w.isnull().values.any())
    print (w.isnull().sum(), '\n')

Simple and quick, the output tells met, for example, that the well ALEXANDER D is missing 466 PE samples, and Recruit F9 is missing 12.

However, the printout is neither easy, nor pleasant to read, as it is a long list like this:

SHRIMPLIN
False
Facies       0
Formation    0
Well Name    0
Depth        0
GR           0
ILD_log10    0
DeltaPHI     0
PHIND        0
PE           0
NM_M         0
RELPOS       0
dtype: int64 

ALEXANDER D
True
Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64 

Recruit F9
True
Facies        0
Formation     0
Well Name     0
Depth         0
GR            0
ILD_log10     0
DeltaPHI      0
PHIND         0
PE           12
NM_M          0
RELPOS        0
dtype: int64
...
...

Log quality tests

For this last part I’ll use Agile Scientific Welly library. Truth be told, I know ahead of time form participating in the contest that the logs in this project are relatively clean and well behaved, apart from the missing values. However, I still wanted to show how to check the quality of some of the curves.

First I need to create aliases, and a dictionary of tests (limited to a few), with:

alias = {
    "Gamma": ["GR"],
    "Resistivity": ['ILD_log10'],
    "Neut-den poro diff": ['DeltaPHI'],
    "Neut-den poro": ['PHIND'],
    "Photoelectric": ['PE'],  
}

import welly.quality as q

tests = {
    'Each': [
        q.no_monotonic,
        q.no_gaps,
    ],
    'Gamma': [
        q.all_positive,
        q.all_below(350),
    ],
    'Resistivity': [
        q.all_positive,
    ],
}

And with those I can create new Well objects from scratch, and for each of them add the logs as curves and run the tests; then I can either show an HTML table with the test results, or save to a csv file (or both).

from IPython.display import HTML, display

for well in training_data['Well Name'].unique():
    w = training_data.loc[training_data['Well Name'] == well]      
    wl = welly.Well({}) 
    for i, log in enumerate(list(w)):
        if not log in ['Depth', 'Facies', 'Formation', 'Well Name', 'NM_M', 'RELPOS']:
            p = {'mnemonic': log}
            wl.data[log] = welly.Curve(w[log], basis=w['Depth'].values, params=p)  
    
    # show test results in HTML table
    print('\n', '\n')
    print('WELL NAME: ' + well)
    print('CURVE QUALITY SUMMARY')
    html = wl.qc_table_html(tests, alias=alias)
    display(HTML(html,  metadata = {'title': well,}))
   
    # Save results to csv file
    r = wl.qc_data(tests, alias=alias)
    hp = pd.DataFrame.from_dict(r, orient = 'index')
    hp.to_csv(str(well) + '_curve_quality_summary.csv', sep=',')

From those I can see that, apart from the issues with the PE log, GR has some high values in SHRIMPLIN, and so on…

All of the above is critical to determine the data imputation strategy, which is the topic of one of the next posts; but first in the next post I will use a number of visualizations of the data, to examine its distribution by well and by facies, and to explore relationships among variables.

MyCarta

A blog about Geoscience, Visualization, Data Science, AI

Geoscience Machine Learning bits and bobs – data inspection

* September 2020 UPDATE *

***************************

Data inspection

Log quality tests

Like this:

Leave a ReplyCancel reply

*** September 2020 UPDATE ***

***************************

Data inspection

Log quality tests

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from MyCarta

* September 2020 UPDATE *