#### 2016 Machine learning contest – Society of Exploration Geophysicists

In a previous post I showed how to use `pandas.isnull`

to find out, for each well individually, if a column has any null values, and `sum`

to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:

for well, data in training_data.groupby('Well Name'): print(well) print (data.isnull().values.any()) print (data.isnull().sum(), '\n')

Simple and quick, the output showed met that – for example – the well ALEXANDER D is missing 466 samples from the PE log:

ALEXANDER D True Facies 0 Formation 0 Well Name 0 Depth 0 GR 0 ILD_log10 0 DeltaPHI 0 PHIND 0 PE 466 NM_M 0 RELPOS 0 dtype: int64

A more appealing and versatile alternative, which I discovered after the contest, comes with the `matrix`

function form the *missingno* library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a *missingno* matrix plot.

for well, data in training_data.groupby('Well Name'): msno.matrix(data, color=(0., 0., 0.45)) fig = plt.gcf() fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length axes=fig.get_axes() axes[0].set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')

#### 2020 Machine Predicted Lithology – FORCE

Since I am taking part in this year’s FORCE Machine Predicted Lithology challenge, I decided to take the above visualization up a notch. Using Ipywidget’s `interactive`

and a similar logic (in this case `data['WELL'].unique()`

) the tool below allows browsing wells using a `Select`

widget and check the chosen well’s curves completeness, on the fly. You can try the tool in this Jupyter notebook.

In a second Jupyter notebook on the other hand, I used missingno `matrix`

to make a quick visual summary plot of the entire dataset completion, log by log (all wells together):

Then, to explore in more depth the data completion, below I also plotted the library’s dendrogram plot. As explained in the library’s documentation, *The dendrogram uses a hierarchical clustering algorithm (courtesy of Scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.*

I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.

Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.

Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.