Visual data exploration in Python – correlation, confidence, spuriousness

This weekend – it was long overdue –  I cleaned up a notebook prepared last year to accompany the talk “Data science tools for petroleum exploration and production“, which I gave with my friend Thomas Speidel at the Calgary 2018 CSEG/CSPG Geoconvention.

The Python notebook can be downloaded from GitHub as part of a full repository, which includes R code from Thomas, or run interactively with Binder.

In this post I want to highlight on one aspect in particular: doing data exploration visually, but also quantitatively with inferential statistic tests.

Like all data scientists (professional, or in the making, and I ‘park’ myself for now in the latter bin), a scatter matrix is often the first thing I produce, after data cleanup, to look for obvious pairwise relationships and trends between variables. And I really love the flexibility, and looks of seaborn‘s pairgrid.

In the example below I use again data from Lee Hunt, Many correlation coefficients, null hypoteses, and high value CSEG Recorder, December 2013; my followers would be familiar with this small toy dataset from reading Geoscience ML notebook 2 and Geoscience_ML_notebook 3.

The scatter matrix in Figure 1 includes bivariate scatter-plots in the upper triangle, contours in the lower triangle, shape of the bivariate distributions on the diagonal.

matplotlib.pyplot.rcParams["axes.labelsize"] = 18
g = seaborn.PairGrid(data, diag_sharey=False)
axes = g.axes
g.map_upper(matplotlib.pyplot.scatter,  linewidths=1, 
            edgecolor="w", s=90, alpha = 0.5)
g.map_diag(seaborn.kdeplot, lw = 4, legend=False)
g.map_lower(seaborn.kdeplot, cmap="Blues_d")
matplotlib.pyplot.show()
Matrix_initial

Scatter matrix from pairgrid

It looks terrific. But, as I said, I’d like to show how to add to the individual scatterplots some useful annotation. These are the ones I typically focus on:

For the Spearman correlation coefficient I use scipy.stats.spearmanr, whereas for the confidence interval and the probability of spurious correlation I use my own functions, which I include below (following, respectively, Stan Brown’s  Stats without tears and Cynthia Kalkomey’s Potential risks when using seismic attributes as predictors of reservoir properties):

def confInt(r, nwells):
    z_crit = scipy.stats.norm.ppf(.975) 
    std_Z = 1/numpy.sqrt(nwells-3)      
    E = z_crit*std_Z
    Z_star = 0.5*(numpy.log((1+r)/(1.0000000000001-r)))
    ZCI_l = Z_star - E
    ZCI_u = Z_star + E
    RCI_l = (numpy.exp(2*ZCI_l)-1)/(numpy.exp(2*ZCI_l)+1) 
    RCI_u = (numpy.exp(2*ZCI_u)-1)/(numpy.exp(2*ZCI_u)+1)
    return RCI_u, RCI_l
def P_spurious (r, nwells, nattributes):
    t_of_r = r * numpy.sqrt((nwells-2)/(1-numpy.power(r,2)))  
    p = scipy.stats.t.sf(numpy.abs(t_of_r), nwells-2)*2 
    ks = numpy.arange(1, nattributes+1, 1)
    return numpy.sum(p * numpy.power(1-p, ks-1))

I also need a function to calculate the critical r, that is, the value of correlation coefficient  above which one can rule out chance as an explanation for a relationship:

def r_crit(nwells, a):
    t = scipy.stats.t.isf(a, nwells-2) 
    r_crit = t/numpy.sqrt((nwells-2)+ numpy.power(t,2))
    return r_crit

where a   is equal to alpha/2, alpha being the level of significance, or the chance of being wrong that one accepts to live with, and nwells-2 is equivalent to the degrees of freedom.

Finally, I need a utility function (adapted from this Stack Overflow answer) to calculate on the fly and annotate the individual scatterplots with the CC, CI, and P.

def corrfunc(x, y, rc=rc, **kws):
    r, p = svipy.stats.spearmanr(x, y)
    u, l = confInt(r, 21)  
    if r > rc:
       rclr = 'g'
    else:
        rclr= 'm' 
    if p > 0.05:
       pclr = 'm'
    else:
        pclr= 'g'
    ax = matplotlib.pyplot.gca()
    ax.annotate("CC = {:.2f}".format(r), xy=(.1, 1.25), 
                xycoords=ax.transAxes, color = rclr, fontsize = 14)
    ax.annotate("CI = [{:.2f} {:.2f}]".format(u, l), xy=(.1, 1.1), 
                xycoords=ax.transAxes, color = rclr, fontsize = 14)
    ax.annotate("PS = {:.3f}".format(p), xy=(.1, .95), 
                xycoords=ax.transAxes, color = pclr, fontsize = 14)

So, now I just calculate the critical r for the number of observations, 21 wells in this case:

rc = r_crit(21, 0.025)

and wrap it all up this way:

matplotlib.pyplot.rcParams["axes.labelsize"] = 18
g = seaborn.PairGrid(data, diag_sharey=False)
axes = g.axes
g.map_upper(matplotlib.pyplot.scatter,  linewidths=1, 
            edgecolor="w", s=90, alpha = 0.5)
g.map_upper(corrfunc)
g.map_diag(seaborn.kdeplot, lw = 4, legend=False)
g.map_lower(seaborn.kdeplot, cmap="Blues_d")
matplotlib.pyplot.show()

so that:

  • the correlation coefficient is coloured green if it is larger than the critical r, else coloured in purple
  • the confidence interval is coloured green if both lower and upper are larger than the critical r, else coloured in purple
  • the probability of spurious correlation is coloured in green when below 0.05 (or 5% chance)

Enhanced scatter matrix

Working with 3D seismic data in Python using segyio and numpy (mostly)

Between 2 and 3 years ago I started turning my long time passion for image processing, and particularly morphological image processing, to the task of fault segmentation.

At the time I shared my preliminary code, of which I was very happy, in a Jupyter notebook, which you can run interactively at this GitHub repository.

Two areas need improvement to get that initial workflow closer to a production one. The first one is on the image processing and morphology side; I am thinking of including: a better way to clean-up very short faults; pruning to eliminate spurious segments in the skeletonization result; the von Mises distribution instead of standard distribution to filter low dip angles. I know how to improve all those aspects, and have some code snippets sitting around in various locations on my Mac, but I am not quite ready to push for it.

The second area, on the seismic side, is ability to work with 3D data. This has been a sore spot for some time.

Enter segyio, a fast, open-source library, developed precisely to work with SEGY files. To be fair, segyio has been around for some time, as I very well know from being a member of the Software Underground community (swung), but it was only a month or so ago that I started tinkering with it.

This post is mostly to share back with the community what I’ve learned in my very first playground session (with some very helpful tips from Jørgen Kvalsvik, a fellow member of swung, and one of the creators of segyio), which allowed me to create a 3D fault segmentation volume (and have lots of fun in the process) from a similarity (or discontinuity) volume.

The workflow, which you can run interactively at this segyio-notebooks GitHub repository (look for the 01 – Basic tutorial) is summarized pictorially in Figure 1, and comprises the steps below:

  • use segy-io to import two seismic volumes in SEGY file format from the F3 dataset, offshore Netherlands, licensed CC-BY-SA: a similarity volume, and an amplitude volume (with dip steered median filter smoothing applied)
  • manipulate the similarity to create a discontinuity/fault volume
  • create a fault mask and display a couple of amplitude time slices with superimposed faults
  • write the fault volume to SEGY file using segy-io, re-using the headers from the input file

Figure 1. Naive 3D seismic fault segmentation workflow in Python.

 

Get the notebooks on GitHub (look for the 01 – Basic tutorial)

Feedback is welcome.

 

DISCLAIMER: The steps outlined in the tutorial are not intended as a production-quality fault segmentation workflow. They work reasonably well on the small, clean similarity volume, artfully selected for the occasion, but it is just a simple example.

Upscaling geophysical logs with Python using Pandas and Bruges

With a few hours of work last weekend, I finished putting together a Jupyter notebook tutorial, started at the Geophysics Python sprint 2018, demonstrating how to:

    • Use Agile Scientific’s Welly to load two wells with several geophysical logs
    • Use PandasWelly, and NumPy to: remove all logs except for compressional wave velocity (Vp), shear wave velocity (Vs), and density (RHOB); store the wells in individual DataFrames; make the sampling rate common to both wells; check for null values; convert units from imperial to metric; convert slowness to velocity; add a well name column
    • Split the DataFrame by well using unique values in the well name column
    • For each group/well use Agile Scientific’s Bruges ‘s Backus average to upscale all curves individually
    • Add the upscaled curves back to the DataFrame

Matt Hall, (organizer), told me during a breakfast chat on the first day of the sprint that this tutorial would be a very good to have since it is one of the most requested examples by neophyte users of the Bruges library; I was happy to oblige.

The code for the most important bit, the last two items in the above list, is included below:

# Define parameters for the Backus filter
lb = 40   # Backus length in meters
dz = 1.0  # Log sampling interval in meters

# Do the upscaling work
wells_bk = pd.DataFrame()
grouped = wells['well'].unique()  
for well in grouped:
    new_df = pd.DataFrame()
    Vp = np.array(wells.loc[wells['well'] == well, 'Vp'])
    Vs = np.array(wells.loc[wells['well'] == well, 'Vs'])
    rhob = np.array(wells.loc[wells['well'] == well, 'RHOB'])
    Vp_bks, Vs_bks, rhob_bks = br.rockphysics.backus(Vp, Vs, rhob, lb, dz)
    new_df['Vp_bk'] = Vp_bks
    new_df['Vs_bk'] = Vs_bks
    new_df['rhob_bk'] = rhob_bks
    wells_bk = pd.concat([wells_bk, new_df])

# Add to the input DataFrame
wells_final = (np.concatenate((wells.values, wells_bk.values), axis=1)) 
cols = list(wells) + list(wells_bk) 
wells_final_df = pd.DataFrame(wells_final, columns=cols)

And here is a plot comparing the raw and upscaled Vp and Vs logs for one of the wells:

Please check the notebook if you want to try the full example.

How to fix rainbows and other bad colormaps using Python

Yep, colormaps again!

In my 2014 tutorial on The Leading Edge I showed how to Evaluate and compare colormaps (Jupyter notebook here). The article followed an extended series of posts (The rainbow is dead…long live the rainbow!) and then some more articles on rainbow-like colormap artifacts (for example here and here).

Last year, in a post titled Unweaving the rainbow, Matt Hall described our joint attempt to make a Python tool for recovering digital data from scientific images (and seismic sections in particular), without any prior knowledge of the colormap. Please check our GitHub repository for the code and slides, and watch Matt’s talk (very insightful and very entertaining) from the 2017 Calgary Geoconvention below:

One way to use the app is to get an image with unknown, possibly awful colormap, get the data, and re-plot it with a good one.

Matt followed up on colormaps with a more recent post titled No more rainbows! where he relentlessly demonstrates the superiority of perceptual colormaps for subsurface data. Check his wonderful Jupyter notebook.

So it might come as a surprise to some, but this post is a lifesaver for those that really do like rainbow-like colormaps. I discuss a Python method to equalize colormaps so as to render them perceptual.  The method is based in part on ideas from Peter Kovesi’s must-read paper – Good Colour Maps: How to Design Them – and the Matlab function equalisecolormap, and in part on ideas from some old experiments of mine, described here, and a Matlab prototype code (more details in the notebook for this post).

Let’s get started. Below is a time structure map for a horizon in the Penobscot 3D survey (offshore Nova Scotia, licensed CC-BY-SA by dGB Earth Sciences and The Government of Nova Scotia). Can you clearly identify the discontinuities in the southern portion of the map? No?

OK, let me help you. Below I am showing the map resulting from running a Sobel filter on the horizon.

This is much better, right? But the truth is that the discontinuities are right there in the original data; some, however, are very hard to see because of the colormap used (nipy spectral, one of the many Matplotlib cmaps),  which introduces perceptual artifacts, most notably in the green-to-cyan portion.

In the figure below, in the first panel (from the top) I show a plot of the colormap’s Lightness value (obtained converting a 256-sample nipy spectral colormap from RGB to Lab) for each sample; the line is coloured by the original RGB colour. This erratic Lightness profile highlights the issue with this colormap: the curve gradient changes magnitude several times, indicating a nonuniform perceptual distance between samples.

In the second panel, I show a plot of the cumulative sample-to-sample Lightness contrast differences, again coloured by the original RGB colours in the colormap. This is the best plot to look at because flat spots in the cumulative curve correspond to perceptual flat spots in the map, which is where the discontinuities become hard to see. Notice how the green-to-cyan portion of this curve is virtually horizontal!

That’s it, it is simply a matter of very low, artificially induced perceptual contrast.

Solutions to this problem: the obvious one is to Other NOT use this type of colormaps (you can learn much about which are good perceptually, and which are not, in here); a possible alternative is to fix them. This can be done by re-sampling the cumulative curve so as to give it constant slope (or constant perceptual contrast). The irregularly spaced dots at the bottom (in the same second panel) show the re-sampling locations, which are much farther apart in the perceptually flat areas and much closer in the more dipping areas.

The third panel shows the resulting constant (and regularly sampled) cumulative Lightness contrast differences, and the forth and last the final Lightness profile which is now composed of segments with equal Lightness gradient (in absolute value).

Here is the structure map for the Penobscot horizon using the nipy spectum before and after equalization on top of each other, to facilitate comparison. I think this method works rather well, and it will allow continued use of their favourite rainbow and rainbow-like colormaps by hard core aficionados.

 

If you want the code to try the equalization, get the noteboook on GitHub.

Using Python to calculate northern hemisphere’s surface land coverage

Yesterday during my lunch break I was rather bored; it is unseasonably cold for the fall, even in Calgary, and a bit foggy too.
For something to do I browsed the Earth Science beta on Stack Exchange looking for interesting questions (as an aside, I encourage readers to look at the unanswered questions).
There was one that piqued my curiosity, “In the northern hemisphere only, what percentage of the surface is land?”.
It occurred to me that I could get together an answer using an equal area projection map and a few lines of Python code; and indeed in 15 minutes I whipped-up this workflow:

  • Invert and import this B/W image of equal area projection (Peters) for the Northern hemisphere (land = white pixels).

Source of original image (full globe): Wikimedia Commons

  • Store the image as a Numpy array.
  • Calculate the total number of pixels in the image array (black + white).
  • Calculate the total number of white pixels (1s) by summing the entire array. Black pixels (0s) will not contribute.
  • Calculate percentage of white pixels.

The result I got is 40.44%. Here’s the code:

# import libraries
import numpy as np
from skimage import io
from matplotlib import pyplot as plt

# import image
url = 'https://mycartablog.com/wp-content/uploads/2018/09/peters_projection_black_north.png'
north_equal_area = io.imread(url, as_grey=True)

# check the image
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1, 1, 1)
ax.set_xticks([])
ax.set_yticks([])
plt.imshow(north_equal_area, cmap = 'gray');

# Do the calculations
r, c = np.shape(north_equal_area)
sz =  r*c
s = np.sum(north_equal_area)
print(np.round(s/sz*100, decimals=2))
>>> 40.44

As suggested in a comment to my initial answer, I run the same Python script for the entire globe and got the expected 30% land coverage:

# import image 
url = 'https://mycartablog.com/wp-content/uploads/2018/09/peters_projection_black_full.png'
equal_area = io.imread(url1, as_grey=True)

# Do the calculations 
r1, c1= np.shape(equal_area)
sz1 =  r1*c1
s1 = np.sum(equal_area)
print(np.round(s1/sz1*100, decimals=2))
>>> 30.08

 

Adding normal and reverse faults to an impedance model with Python

This is a short and sweet follow-up to yesterday’s post Making many synthetic seismic models in Python, in which I want to show how to add oblique faults to an impedance model, as opposed to a vertical one.

In the pseudo-code below, with the first line of code I select one of the impedance models from the many I created, then in lines 2 and 3, respectively, make two new versions, one shifted up 5 samples, one shifted down 5 samples. The impedance values for the extra volume added – at the bottom in the first case, the top in the second – are derived from the unshifted impedance model.

Next, I make two copies of the original, call them normal and reverse, and then replace the upper triangle with the upper triangle of the shifted down and shifted up versions, respectively.

unshifted = impedances[n]
up = sp.ndimage.interpolation.shift(unshifted, (5,0), cval=unshifted[0,0] * 0.9)
down = sp.ndimage.interpolation.shift(unshifted, (-5,0), cval=unshifted[0,-1] * 0.8)
normal = copy.deepcopy(unshifted)
reverse = copy.deepcopy(unshifted)
sz = len(unshifted)-1
normal[np.triu_indices(sz)] = up[np.triu_indices(sz)]
reverse[np.triu_indices(sz)] = down[np.triu_indices(sz)]

Done!
The results are shown in the figure below.

Left: unfaulted impedance model; center, model with normal fault; right, model with reverse fault.

Making many synthetic seismic models in Python

In this short post I show how to adapt Agile Scientific‘s Python tutorial x lines of code, Wedge model and adapt it to make 100 synthetic models in one shot: X  impedance models times X wavelets times X random noise fields (with I vertical fault).

You can download the notebook with the full code from GitHubN.B. the code is optimized for Python 2.7.

I begin by making a 6-layer model, shown in Figure 1:

model = numpy.zeros((50,49), dtype=np.int) 
model[8:16,:] = 1
model[16:24,:] = 2
model[24:32,:] = 3
model[32:40,:] = 4
model[40:,:] = 5

Figure 1. Initial 6-layer model

next I make some Vp-rho pairs (rock 0, rock 1, … , rock5):

rocks = numpy.array([[2700, 2750],  # Vp, rho
                  [2400, 2450],
                  [2600, 2650], 
                  [2400, 2450],
                  [2800, 3000], 
                  [3100, 3200],])

and then create 10 slightly different variations of the Vp-rho pairs one of which are is shown in Figure 2:

rnd = numpy.random.rand(10,6,2)*0.2
manyrocks = np.array([rocks + rocks*rn for rn in rnd], dtype=np.int)
earth = manyrocks[model]

Figure 2. A Vp-rho pair (earth model)

at which point I can combine Vp-rho pairs to make 10 impedance models, then insert a vertical fault with:

impedances = [np.apply_along_axis(np.product, -1, e).astype(float) for e in earth]# Python 2
faulted = copy.deepcopy(impedances)
for r, i in zip(faulted, np.arange(len(faulted))):
    temp = np.array(r)
    rolled = np.roll(np.array(r[:,:24]), 4, axis = 0)
    temp[:,:24]=rolled
    faulted[i]=temp

Figure 3. Four faulted impedance models.

next I calculate reflection coefficients (Figure 4)and convolve them with a list of 10 Ricker wavelets (generated using Agile’s Bruges) to make synthetic seismic models, shown in Figure 5.

rc =  [(flt[1:,:] - flt[:-1,:]) / (flt[1:,:] + flt[:-1,:]) for flt in faulted]
ws = [bruges.filters.ricker(duration=0.098, dt=0.002, f=fr) 
      for fr in [35, 40, 45, 50, 55, 60, 65, 70, 75, 80]]
synth = np.array([np.apply_along_axis(lambda t: np.convolve(t, w, mode='same'), axis=0,
      arr=r) for r in rc for w in ws ])

Figure 4. Four reflection coefficients series.

Figure 5. Four synthetic seismic models with vertical fault.

The last bit is the addition of noise, with the result is shown in Figure 6:

blurred = sp.ndimage.gaussian_filter(synth, sigma=1.1)
noisy = blurred + 0.5 * blurred.std() * np.random.random(blurred.shape)

Figure 6. Four synthetic seismic models with vertical fault and noise.

Done!

The notebook with the full code is on GitHub, let me know if you find this useful or if you come up with more modeling ideas.

 

UPDATE, April 22, 2019.

I received an email from a reader, Will, asking some clarification about this article and the making of many impedance models.  I’m re-posting here what I wrote back.

I think the key to understand this is how we multiply velocity by density in each of the possible earth model.

Looking at the notebook, the earth model array has shape:
print (np.shape(earth))
>>> (10, 50, 49, 2)
with the last axis having dimension 2: one Vp and one Rho, so in summary 10 models of size 50×49, each with a Vp and a Rho.
So with this other block of code:
impedances = [np.apply_along_axis(np.product, 
              -1, e).astype(float) for e in earth]

you use numpy.apply_along_axis  to multiply Vp by Rho along the last dimension, -1 , using numpy.product, and the list comprehension [... for e in earth] to do it for all models in the array earth.

 

What is acquisition footprint noise in seismic data?

Acquisition footprint is a noise field that appears on 3D seismic amplitude slices or horizons as an interwoven linear crosshatching parallel to the source line and receiver line directions. It is for the most part an expression of inadequate acquisition geometry, resulting in insufficient sampling of the seismic wave field (aliasing) and irregularities in the offset and azimuth distribution, particularly in the cross line direction.

Sometimes source-generated noise and incorrect processing (for example residual NMO due to erroneous velocity picks, incomplete migration, or other systematic errors) can accentuate the footprint.

This noise can interfere with the mapping of stratigraphic features and fault patterns, posing a challenge to seismic interpreters working in both exploration and development settings.

To demonstrate the relevance of the phenomenon I show below a gallery of examples from the literature of severe footprint in land data: an amplitude time slice (Figure 1a) and a vertical section (Figure 1b) from a Saudi Arabian case study, some seismic attributes (Figures 2, 3, 4, and 5), and also some modeled streamer data (Figure 6).

Figure 1. Amplitude time slice (top, time = 0.44 s) showing footprint in both inline and crossline direction, and amplitude section (bottom) highlighting the effect in the vertical direction. From Al-Bannagi et al. Copyrighted material.

Figure 2. Edge detection (Sobel filter) on the Penobscot 3D horizon (average time ~= 0.98 s) displaying N-S footprint. From Hall.

Figure 3. Edge detection (Sobel filter) on a shallow horizon (average time ~= 0.44 s)  from the F3 Netherlands 3D survey displaying E-W footprint.

Figure 4. Similarity attribute (top , time = 0.6 s), and most positive curvature (bottom, time = 1.3 s), both showing footprint. From Davogustto and Marfurt. Copyrighted material.

Figure 5. Amplitude time slice (top, time = 1.32 s) the corresponding  coherence section  (bottom) both showing footprint. From Chopra and Larsen. Copyrighted material.

Figure 6. Acquisition footprint in the form of low fold striation due to dip streamer acquisition. From Long et al. Copyrighted material.

In my next post I will review (with more examples form literature) some strategies available to either prevent or minimize the footprint with better acquisition parameters and modeling of the stack response; I will also discuss some ways the footprint can be attenuated after the acquisition of the data (with bin regularization/interpolation, dip-steered median filters, and kx ky filters, from simple low-pass to more sophisticated ones) when the above mentioned strategies are not available, due to time/cost constraint or because the interpreter is working with legacy data.

In subsequent posts I will illustrate a workflow to model synthetic acquisition footprint using Python, and how to automatically remove it in the Fourier domain with frequency filters, and then how to remove it from real data.

References

Al-Bannagi et al. 2005 – Acquisition footprint suppression via the truncated SVD technique: Case studies from Saudi Arabia: The Leading Edge, SEG, 24, 832– 834.

Chopra and Larsen,  2000 – Acquisition Footprint, Its Detection and Removal: CSEG Recorder, 25 (8).

Davogusto and Martfurt, 2011 – Footprint Suppression Applied to Legacy Seismic Data Volumes: 31st Annual GCSSEPM Foundation Bob F Perkins Research Conference 2011.

F3 Netherlands open access 3D:  info on SEG Wiki

Hall, 2014 –  Sobel filtering horizons (open source Jupyter Notebook on GitHub).

Long et al., 2004 – On the issue of strike or dip streamer shooting for 3D multi-streamer acquisition: Exploration Geophysics, 35(2), 105-110.

Penobscot open access 3D:  info on SEG Wiki