Geoscience Machine Learning bits and bobs – data completeness

2016 Machine learning contest – Society of Exploration Geophysicists

In a previous post I showed how to use pandas.isnull to find out, for each well individually, if a column has any null values, and sum to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:

for well, data in training_data.groupby('Well Name'): 
print(well)
print (data.isnull().values.any())
print (data.isnull().sum(), '\n')

Simple and quick, the output showed met that  – for example – the well ALEXANDER D is missing 466 samples from the PE log:

ALEXANDER D
True
Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64

A more appealing and versatile alternative, which I discovered after the contest, comes with the matrix function form the missingno library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a missingno matrix plot.

for well, data in training_data.groupby('Well Name'): 

msno.matrix(data, color=(0., 0., 0.45)) 
fig = plt.gcf()
fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length 
axes=fig.get_axes()
axes[0].set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')

I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.

Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.

Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.

Computer vision in geoscience: recover seismic data from images, introduction

In a recent post titled Unweaving the rainbow, Matt Hall described our joint attempt (partly successful) to create a Python tool to enable recovery of digital data from any pseudo-colour scientific image (and a seismic section in particular, like the one in Figure 1), without any prior knowledge of the colormap.

Seismic picture on wall

Figure 1. Test image: a photo of a distorted seismic section on my wall.

Please check our GitHub repository for the code and slides and watch Matt’s talk (very insightful and very entertaining) from the 2017 Calgary Geoconvention below:

In the next two post, coming up shortly, I will describe in greater detail my contribution to the project, which focused on developing a computer vision pipeline to automatically detect  where the seismic section is located in the image, rectify any distortions that might be present, and remove all sorts of annotations and trivia around and inside the section. The full workflow is included below (with sections I-VI developed to date):

  • I – Image preparation, enhancement:
    1. Convert to gray scale
    2. Optional: smooth or blur to remove high frequency noise
    3. Enhance contrast
  • II – Find seismic section:
    1. Convert to binary with adaptive or other threshold method
    2. Find and retain only largest object in binary image
    3. Fill its holes
    4. Apply opening and dilation to remove minutiae (tick marks and labels)
  • III – Define rectification transformation
    1. Detect contour of largest object find in (2). This should be the seismic section.
    2. Approximate contour with polygon with enough tolerance to ensure it has 4 sides only
    3. Sort polygon corners using angle from centroid
    4. Define new rectangular image using length of largest long and largest short sides of initial contour
    5. Estimate and output transformation to warp polygon to rectangle
  • IV – Warp using transformation
  • V – Blanking annotations inside seismic section (if rectangular):
    1. Start with output of (4)
    2. Pre-process and apply canny filter
    3. Find contours in the canny filter smaller than input size
    4. Sort contours (by shape and angular relationships or diagonal lengths)
    5. Loop over contours:
      1. Approximate contour
      2. If approximation has 4 points AND the 4 semi-diagonals are of same length: fill contour and add to mask
  • VI – Use mask to remove text inside rectangle in the input and blank (NaN) the whole rectangle. 
  • VII – Optional: tools to remove arrows and circles/ellipses:
    1. For arrows – contours from (4) find ones with 7 sizes and low convexity (concave) or alternatively Harris corner and count 7 corners, or template matching
    2. For ellipses – template matching or regionprops
  • VIII – Optional FFT filters to remove timing lines and vertical lines

You can download from GitHub all the tools for the automated workflow (parts I-VI) in the module mycarta.py, as well as an example Jupyter Notebook showing how to run it.

The first post focuses on the image pre-processing and enhancement, and the detection of the seismic line (sections I and II, in green); the second one deals with the rectification of the seismic (sections IV to V, in blue). They are not meant as full tutorials, rather as a pictorial road map to (partial) success, but key Python code snippets will be included and discussed.

NASA’s beautiful ‘Planet On Fire’ images and video

Please give credit for this item to: NASA's Goddard Space Flight Center and NASA Center for Climate Simulation Australia photo courtesy of Flagstaffotos

Credits: NASA’s Goddard Space Flight Center and NASA Center for Climate Simulation.

 

NASA_PLANET_ON_FIRE

Click on the image to watch the original video on NASA’s Visualization Explorer site.

Read the full story on NASA’s Visualization Explorer site.

Mapping and validating geophysical lineaments with Python

In Visualization tips for geoscientists: MATLAB, Part III I showed there’s a qualitative correlation between occurrences of antimony mineralization in the southern Tuscany mining district and the distance from lineaments derived from the total horizontal derivative (also called maximum horizontal gradient).

Let’s take a look at the it below (distance from lineaments increases as the color goes from blue to green, yellow, and then red).

regio_distance_mineral_occurrences

However, in a different map in the same post I showed that lineaments derived using the maxima of the hyperbolic tilt angle (Cooper and Cowan, 2006, Enhancing potential field data using filters based on the local phase) are offset systematically from those derived using the total horizontal derivative.

Let’s take a look at the it below: in this case Bouguer gravity values increase as the color goes from blue to green, yellow, and then red; white polygons are basement outcrops.

The lineaments from the total horizontal derivative are in black, those from the maxima of hyperbolic tilt angle are in gray. Which lineaments should be used?

The ideal way to map the location of density contrast edges (as a proxy for geological contacts) would be  to gravity forward models, or even 3D gravity inversion, ideally constrained by all available independent data sources (magnetic or induced-polarization profiles, exploratory drilling data, reflection seismic interpretations, and so on).

The next best approach is to map edges using a number of independent gravity data enhancements, and then only use those that collocate.

Cooper and Cowan (same 2006 paper) demonstrate that no single-edge detector method is a perfect geologic-contact mapper. Citing Pilkington and Keating (2004, Contact mapping from gridded magnetic data – A comparison of techniques) they conclude that the best approach is to use “collocated solutions from different methods providing increased confidence in the reliability of a given contact location”.

I show an example of such a workflow in the image below. In the first column from the left is a map of the residual Bouguer gravity from a smaller area of interest in the southern Tuscany mining district (where measurements were made on a denser station grid). In the second column from the left are the lineaments extracted using three different (and independent) derivative-based data enhancements followed by skeletonization. The same lineaments are superimposed on the original data in the third column from the left. Finally, in the last column, the lineaments are combined into a single collocation map to increase confidence in the edge locations (I applied a mask so as to display edges only where at least two methods collocate).

collocation_teaser

If you want to learn more about this method, please read my note in the Geophysical tutorial column of The Leading Edge, which is available with open access here.
To run the open source Python code, download the iPython/Jupyter Notebook from GitHub.

With this notebook you will be able to:

1) create a full suite of derivative-based enhanced gravity maps;

2) extract and refine lineaments to map edges;

3) create a collocation map.

These technique can be easily adapted to collocate lineaments derived from seismic data, with which the same derivative-based enhancements are showing promising results (Russell and Ribordy, 2014, New edge detection methods for seismic interpretation.)

Logarithmic spiral, nautilus, and rainbow

The other day I stumbled into an interesting article on The Guardian online: The medieval bishop who helped to unweave the rainbow. In the article I learned for the first time of Robert Grosseteste, a 13th century British scholar (with an interesting Italian last name: Grosse teste = big heads) who was also the Bishop of Lincoln.

The Bishops’ interests and investigations covered diverse topics, making him a pre-renaissance polymath; however, it is his 1225 treatise on colour, the De Colore, that is receiving much attention.

In a recent commentary on Nature Physics (All the colours of the rainbow), and reference therein (A three-dimensional color space from the 13th century),  Smithson et al. (who also recently published a new critical edition/translation of the treatise with analysis and critical commentaries) analyze the 3D colorspace devised by Grosseteste, who claimed it allows the generation of all possible colours and to describe the variations of colours among different rainbows.

As we learn from Smithson et al., Grosseteste’s colorpsace had three dimensions, quantified by physical properties of the incident light and the medium: these are the scattering angle (which produces variation of hue within a rainbow), the purity of the scattering medium (which produces variation between different rainbows and is linked to the size of the water droplets in the rainbow), and the altitude of the sun (which produces variation in the light incident on a rainbow). The authors were able to model this colorspace and also to show that the locus of rainbow colours generated in that colorspace forms a spiral surface (a family of spiral curves, each form a specific rainbow) in  the perceptual CIELab colorspace.

I found this not only fascinating – a three-dimensional, perceptual colorspace from the 13th century!! – but also a source of renewed interest in creating the perfect perceptual colormaps by spiralling through CIELab.

My first attempt of colormap spiralling in CIELab, CubicYF, came to life by selecting hand-picked colours on CIELab colour charts at fixed lightness values (found in this document by Gernot Hoffmann). The process was described in this post, and you can see an animation of the spiral curve in CIELab space (created with the 3D color inspector plugin in ImageJ) in the video below:

Some time later, after reading this post by Rob Simmon (in particular the section on the NASA Ames Color Tool), and after an email exchange with Rob, I started tinkering with the idea of creating perceptual rainbow colormaps in CIELab programmatically, by using a helix curve or an Archimedean spiral, but reading Smithson et al. got me to try the logarithmic spiral.

So I started my experiments with a warm-up and tried to replicate a Nautilus using a logarithmic spiral with a growth ratio equal to 0.1759. You may have read that the rate at which a Nautilus shell grows can be described by the golden ratio phi, but in fact the golden spiral constructed from a golden rectangle is not a Nautilus Spiral (as an aside, as I was playing with the code I recalled reading some time ago Golden spiral, a nice blog post (with lots of code) by Cleve Moler, creator of the first version of Matlab,  who simulated a golden spiral using a continuously expanding sequence of golden rectangles and inscribed quarter circles).

My nautilus-like spiral, plotted in Figure 1, has a growth ratio of 0.1759 instead of the golden ratio of 1.618.

nautilus logarithmic spiral with growth ratio = 0.1759

Figure 1: nautilus-like spiral with growth ratio = 0.1759

And here’s the colormap (I called it logspiral) I came up with after a couple of hours of hacking: as hue cycles from 360 to 90 degrees, chroma spirals outwardly (I used a logarithmic spiral with polar equation c1*exp(c2*h) with a growth ratio c2 of 0.3 and a constant c1 of 20), and lightness increases linearly from 30 to 90.

Figure 2 shows the trajectory in the 2D CIELab a-b plane; the colours shown are the final RGB colours. In Figure 3 the trajectory is shown in 3D CIELab space. The coloured lightness profiles were made using the Colormapline submission from the Matlab File Exchange.

2D logspiral colormap in CIELab a-b plane

Figure 2: logspiral colormap trajectory in CIELab a-b plane

logspiral colormap trajectory in CIELab 3D space

Figure 3: logspiral colormap in CIELab 3D space

 

N.B. In creating logspiral, I was inspired by Figure 2 in the Nature Physics paper, but there are important differences in terms of colorspace, lightness profile and perception: I am not certain their polar coordinates are equivalent to Lightness, Chroma, and Hue, although they could; and, more importantly, the three-dimensional spirals based on Grosseteste’s colorpsace go from low lightness at low scattering angles to much higher values at mid scattering angles, and then drop again at high scattering angle (remember that these spirals describe real world rainbows), whereas lightness in logspiral lightness is strictly monotonically increasing.

In my next post I will share the Matlab code to generate a full set of logspiral colormaps sweeping the hue circle from different staring colours (and end colors) and also the slower-growing logarithmic spirals to make a set of monochromatic colormaps (similar to those in Figure 2 in the Nature Physics paper).

 

New rainbow colormap: sawthoot-shaped lightness profile

Why another rainbow

In the comment section of my last post Steve Eddins from Mathworks reported that some Matlab users prefer Jet to Parula, the new default perceptual colormap in Matlab, because within certain ranges Jet affords a greater contrast, intended as the rate of change in lightness.

My counter-argument to that is that yes, some data may benefit from being displayed using Jet (in terms of contrast, and hence the power to resolve smaller anomalies) because of those areas of very steep rate of change of lightness, like the blue to cyan and yellow to red portions (see Figure 1). But the price one has to pay is that there is an area of very low gradient (a greenish band between cyan and yellow) where there’s nearly no contrast, which would obfuscate subtle anomalies in the data. On top of that there’s no control of where each of those areas are located, so a lot of effort has to go into trying to fit those regions of artificially high contrast to the portion of data of interest.

L_profile_jet_cl

Figure 1

Because of their high lightness, the yellow and cyan artificial edges also cause problems. In his latest blog post Steve uses a test pattern do demonstrate how they make the interpretation of trivial structures more difficult. He also explains why they occurr in some locations and not others in the first place. I wonder if the resulting regions of high lightness juxtaposed to regions of low lightness could be chromatic Mach bands.

Additionally, as Steve points out, the low-contrast juxtaposition of dark red and dark blue bands creates the visual illusion of depth (Chromostereopsis) in other positions of the test pattern, creating further confusion.

But I have some good news for the hardcore fans of Jet, and rainbow colormaps in general. I created a rainbow with a sawtooth-shaped lightness profile made up of 5 ramps, each with the same  rate of change in lightness and total lightness change of 60, and alternatively negative and positive signs. This is shown in Figure 2, and replaces the lightness profile of a basic 6-color rainbow (magenta-blue-cyan-green-yellow-red) shown in Figure 3.

Figure 2

Figure 2

Figure 2

Figure 3

With this rainbow users have the ability to apply greater contrast to their data to boost small anomalies, but in a more controlled way. The colormap is available with my File Exchange function, Perceptually improved colormaps. Below is the Matlab code I used to generate the new rainbow.

Matlab code

To run this code you will need Colorspace, a free function from Matlab File Exchange, for the color space transformations.

%% basic 6-colour rainbow
% create RGB components
m = [1, 0, 1]; % magenta
b = [0, 0, 1]; % blue
c = [0, 1, 1]; % cyan
g = [0, 1, 0]; % green
y = [1, 1, 0]; % yellow
r = [1, 0, 0]; % red
% concatenate components
rgb = vertcat(m,b,c,g,y,r);
% interpolate to 256 colours
rainbow=interp1(linspace(1, 256, 6),rgb,[1:1:256]);
%% calculate Lab components
% convert from RGB to Lab colour space
% requires this function: Colorspace transforamtions
% www.mathworks.com/matlabcentral/fileexchange/28790-colorspace-transformations
lab = colorspace('RGB->Lab',rainbow);
%% replace random lightness profile with sawtooth-shaped profile
% contrast (magnitude of lightness change) between
% each pair of adjeacent colors set to 60
L1 = [90, 30, 90, 30, 90, 30];
% interpolate to 256 lightness values
L1int = interp1(linspace(1, 256, 6),L1,[1:1:256])';
% replace
lab1 = horzcat(L1int,lab(:,2),lab(:,3));
%% new rainbow
% convert back from Lab to RGB colour space
swtth = colorspace('RGB<-Lab',lab1);

Test results

Figures 4, 5, and 6 show the three colormaps used with my Pyramid test surface (notice in Figure 5 that the green band artifact with this rainbow is even more pronounced than with jet). I welcome feedback.

Figure 4

Figure 4

Pyramid_basic_rainbow

Figure 5

Figure 4

Figure 6

Aknowledgements

The coloured lightness profiles were made using the Colormapline submission from the Matlab File Exchange.

 

Visualizing colormap artifacts

In Evaluate and compare colormaps, I have shown how to extract and display the lightness profile of a colormap using Python. I do this routinely with colormaps, but I realize it takes an effort, and not all users may feel comfortable using code to test whether a colormap is perceptual or not.

This got me thinking that there is perhaps a need for a user-friendly, interactive tool to help identify colormap artifacts, and wondering how it would look like.

In a previous post, Comparing color palettes, I plotted the elevation for the South American continent from the Global Land One-km Base Elevation Project using four different color palettes. In Figure 1 below I plot again 3 of those: rainbow, linear lightness rainbow, and grayscale, respectively, from left to right. In maps like these some artifacts are very evident. For example there’s a classic film negative effect in the map on the left, where the Guiana Highlands and the Brazilian Highlands, both in blue, seem to stand lower than the Amazon basin, in violet. This is due to the much lower lightness (or alternatively intensity) of the colour blue compared to the violet.

South_America_maps_post

Figure 1

 

However, other artifacts are more subtle, like the inversion of the highest peaks in the Andes, which are coloured in red, relative to their surroundings, in particular the Altipiano, an endorheic basin that includes Lake Titicaca.

My idea for this tool is simple, and consists of two windows. The first is a basemap window which can display either a demo dataset or user data loaded from an ASCII grid file. In this window the user would interactively select a profile by building a polyline with point-and-click, like the one in Figure 2 in white.

South_America_map_window

Figure 2

The second window would show the elevation profile with the colour fill assigned based on the colormap, like in Figure 3 at the bottom (with colormap to the right), and with a profile of the corresponding colour intensities (on a scale 1-255) at the top.

In this view it is immediately evident that, for example, the two highest peaks near the center, coloured in red, are relative intensity lows. Another anomaly is the absolute intensity low on the right side, corresponding to the colour blue, where the elevation profile varies smoothly.

Figure 3

Figure 3

I created this concept prototype using a combination of Matlab, Python, and Surfer. I welcome suggestions for possible additional features, and would like to hear form folks interested in collaboration on a web app (ideally in Python).

Convert color palettes to python matplotlib colormaps

This is a quick post to show you how to import my perceptual color palettes – or any other color palette – into Python and convert them into Matplotlib colormaps. We will use as an example the CIE Lab linear L* palette, which was my adaptation to Matlab of the luminance controlled colormap by Kindlmann et al.

Introduction

You can use the code snippets in here or download the iPython notebook from here (*** please see update ***). You will need NumPy in addition to Matplotlib.

***  UPDATE  ***

Fellow Pythonista Matt Hall of Agile Geoscience extended this work – see comment below to include more flexible import of the data and formatting routines, and code to reverse the colormap. Please use Matt’s expanded iPython notebook.

Preliminaries

First of all, get the color palettes in plain ASCII format rom this page. Download the .odt file for the RGB range 0-1 colors, change the file extension to .zip, and unzip it. Next, open the file Linear_L_0-1, which contains comma separated values, replace commas with tabs, and save as .txt file. That’s it, the rest is in Python.

Code snippets – importing data

Let’s bring in numpy and matplotlib:

%pylab inline
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

Now we load the data. We get RGB triplets from tabs delimited text file. Values are expected in the 0-1 range, not 0-255 range.

LinL = np.loadtxt('Linear_L_0-1.txt')

A quick QC of the data:

LinL[:3]

gives us:

array
([[ 0.0143,  0.0143,  0.0143],
  [ 0.0404,  0.0125,  0.0325],
  [ 0.0516,  0.0154,  0.0443]])

looking good!

Code snippets – creating colormap

Now we want to go from an array of values for blue in this format:

b1=[0.0000,0.1670,0.3330,0.5000,0.6670,0.8330,1.0000]

to a list that has this format:

[(0.0000,1.0000,1.0000),
(0.1670,1.0000,1.0000),
(0.3330,1.0000,1.0000),
(0.5000,0.0000,0.0000),
(0.6670,0.0000,0.0000),
(0.8330,0.0000,0.0000),
(1.0000,0.0000,0.0000)]

# to a dictionary entry that has this format:

{
'blue': [
(0.0000,1.0000,1.0000),
(0.1670,1.0000,1.0000),
(0.3330,1.0000,1.0000),
(0.5000,0.0000,0.0000),
(0.6670,0.0000,0.0000),
(0.8330,0.0000,0.0000),
(1.0000,0.0000,0.0000)],
...
...
}

which is the format required to make the colormap using matplotlib.colors. Here’s the code:

b3=LinL[:,2] # value of blue at sample n
b2=LinL[:,2] # value of blue at sample n
b1=linspace(0,1,len(b2)) # position of sample n - ranges from 0 to 1

# setting up columns for list
g3=LinL[:,1]
g2=LinL[:,1]
g1=linspace(0,1,len(g2))

r3=LinL[:,0]
r2=LinL[:,0]
r1=linspace(0,1,len(r2))

# creating list
R=zip(r1,r2,r3)
G=zip(g1,g2,g3)
B=zip(b1,b2,b3)

# transposing list
RGB=zip(R,G,B)
rgb=zip(*RGB)
# print rgb

# creating dictionary
k=['red', 'green', 'blue']
LinearL=dict(zip(k,rgb)) # makes a dictionary from 2 lists

Code snippets – testing colormap

That’s it, with the next 3 lines we are done:

my_cmap = matplotlib.colors.LinearSegmentedColormap('my_colormap',LinearL)
pcolor(rand(10,10),cmap=my_cmap)
colorbar()

which gives you this result: LinearL test Notice that if the original file hadn’t had 256 samples, and you wanted a 256 sample color palette, you’d use the line below instead:

my_cmap = matplotlib.colors.LinearSegmentedColormap('my_colormap',LinearL,256)

Have fun!

Related Posts (MyCarta)

The rainbow is dead…long live the rainbow! – Part 5 – CIE Lab linear L* rainbow

Visualize Mt S Helens with Python and a custom color palette

A bunch of links on colors in Python

http://bl.ocks.org/mbostock/5577023

http://matplotlib.org/api/cm_api.html

http://matplotlib.org/api/colors_api.html

http://matplotlib.org/examples/color/colormaps_reference.html

http://matplotlib.org/api/colors_api.html#matplotlib.colors.ListedColormap

http://wiki.scipy.org/Cookbook/Matplotlib/Show_colormaps

http://stackoverflow.com/questions/21094288/convert-list-of-rgb-codes-to-matplotlib-colormap

http://stackoverflow.com/questions/16834861/create-own-colormap-using-matplotlib-and-plot-color-scale

http://stackoverflow.com/questions/11647261/create-a-colormap-with-white-centered-around-zero