2016 Machine learning contest – Society of Exploration Geophysicists
In a previous post I showed how to use pandas.isnull to find out, for each well individually, if a column has any null values, and sum to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:
for well, data in training_data.groupby('Well Name'):
print (data.isnull().sum(), '\n')
Simple and quick, the output showed met that – for example – the well ALEXANDER D is missing 466 samples from the PE log:
Well Name 0
A more appealing and versatile alternative, which I discovered after the contest, comes with the matrix function form the missingno library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a missingno matrix plot.
for well, data in training_data.groupby('Well Name'):
msno.matrix(data, color=(0., 0., 0.45))
fig = plt.gcf()
fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length
axes.set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')
In each of the following plots, the sparklines at the right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset. This to me is a much more compelling and informative way to inspect log data as it shows the data range where data is missing. The well length is also annotated on the bottom left, by which information I learned that Recruit F9 is much shorter than the other wells. And to ensure this is reinforced I introduced a line to modify each plot so that its height reflects the length. I really like it!
2020 Machine Predicted Lithology – FORCE
Since I am taking part in this year’s FORCE Machine Predicted Lithology challenge, I decided to take the above visualization up a notch. Using Ipywidget’s interactive and a similar logic (in this case data['WELL'].unique()) the tool below allows browsing wells using a Select widget and check the chosen well’s curves completeness, on the fly. You can try the tool in this Jupyter notebook.
In a second Jupyter notebook on the other hand, I used missingno matrix to make a quick visual summary plot of the entire dataset completion, log by log (all wells together):
Then, to explore in more depth the data completion, below I also plotted the library’s dendrogram plot. As explained in the library’s documentation, The dendrogram uses a hierarchical clustering algorithm (courtesy of Scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.
I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.
Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.
Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.
Last year, in a post titled Unweaving the rainbow, Matt Hall described our joint attempt to make a Python tool for recovering digital data from scientific images (and seismic sections in particular), without any prior knowledge of the colormap. Please check our GitHub repositoryfor the code and slides, andwatch Matt’s talk (very insightful and very entertaining) from the 2017 Calgary Geoconvention below:
One way to use the app is to get an image with unknown, possibly awful colormap, get the data, and re-plot it with a good one.
So it might come as a surprise to some, but this post is a lifesaver for those that really do like rainbow-like colormaps. I discuss a Python method to equalize colormaps so as to render them perceptual. The method is based in part on ideas from Peter Kovesi’s must-read paper – Good Colour Maps: How to Design Them – and the Matlab function equalisecolormap, and in part on ideas from some old experiments of mine, described here, and a Matlab prototype code (more details in the notebook for this post).
Let’s get started. Below is a time structure map for a horizon in the Penobscot 3D survey(offshore Nova Scotia, licensed CC-BY-SA by dGB Earth Sciences and The Government of Nova Scotia). Can you clearly identify the discontinuities in the southern portion of the map? No?
OK, let me help you. Below I am showing the map resulting from running a Sobel filter on the horizon.
This is much better, right? But the truth is that the discontinuities are right there in the original data; some, however, are very hard to see because of the colormap used (nipy spectral, one of the many Matplotlib cmaps), which introduces perceptual artifacts, most notably in the green-to-cyan portion.
In the figure below, in the first panel (from the top) I show a plot of the colormap’s Lightness value (obtained converting a 256-sample nipy spectral colormap from RGB to Lab) for each sample; the line is coloured by the original RGB colour. This erratic Lightness profile highlights the issue with this colormap: the curve gradient changes magnitude several times, indicating a nonuniform perceptual distance between samples.
In the second panel, I show a plot of the cumulative sample-to-sample Lightness contrast differences, again coloured by the original RGB colours in the colormap. This is the best plot to look at because flat spots in the cumulative curve correspond to perceptual flat spots in the map, which is where the discontinuities become hard to see. Notice how the green-to-cyan portion of this curve is virtually horizontal!
That’s it, it is simply a matter of very low, artificially induced perceptual contrast.
Solutions to this problem: the obvious one is to Other NOT use this type of colormaps (you can learn much about which are good perceptually, and which are not, in here); a possible alternative is to fix them. This can be done by re-sampling the cumulative curve so as to give it constant slope (or constant perceptual contrast). The irregularly spaced dots at the bottom (in the same second panel) show the re-sampling locations, which are much farther apart in the perceptually flat areas and much closer in the more dipping areas.
The third panel shows the resulting constant (and regularly sampled) cumulative Lightness contrast differences, and the forth and last the final Lightness profile which is now composed of segments with equal Lightness gradient (in absolute value).
Here is the structure map for the Penobscot horizon using the nipy spectum before and after equalization on top of each other, to facilitate comparison. I think this method works rather well, and it will allow continued use of their favourite rainbow and rainbow-like colormaps by hard core aficionados.
A couple of years ago I stumbled in a great 2001 paper by Beyer  on The Leading Edge. Being interested in visualization techniques I was drawn by the display in Figure 1 (which is a low resolution copy from Figure 2 in the paper). But what really amazed me was the suggestion that a display like this could be created in a few minutes, without doing any interpretation, by just manipulating instantaneous phase slices. With the only condition of having data of fair quality, this promised to be an awesome reconnaissance tool.
Figure 1 – Copyright SEG
The idea that instantaneous phase is a great attribute for interpretation has been around for a long time. There is, for example, a 1989 Exploration Geophysics paper by Duff and Mason . These two authors argue that amplitude time slices are a suboptimal choice, and that instantaneous phase slices should be preferred. They give three reasons:
1) on amplitude time slices only relatively strong events remain above the bias level after gain and scaling. Weak events are submerged below the bias and remain unmappable. Although it is the topic of a future post, it is worth mentioning I think this effect is exacerbated by the common but unfortunate choice of a divergent color palette with white in the middle. White is so bright (I call it white hole) that even more low amplitude events become indiscernible.
2) discrete boundaries corresponding to unique positions on the wavelet are displayed on instantaneous phase slices – this intra wavelet detail is lost on amplitude slices.
Figure 2 – after Duff and Mason, Figure 2
3) instantaneous time slices give DIRECTLY the sense of time dip for dipping events. In Figure 2 I show 2 parallel dipping reflectors, represented by 5 (non consecutive) traces, and 2 (non-consecutive) instantaneous phase slices (at arbitrary t1 an t2). I marked 5 discrete phase events for the top dipping reflector. The sense of time dip is given (with appropriate color palette) by the sense of color transition. Conversely, this intra wavelet detail would be lost on the amplitude time slices, with amplitudes between the black center trace and the red traces, and amplitudes between the red traces and the green traces lost within a single broad zone. The difference is probably not as dramatic nowadays with the increase in dynamic ranges available, but using instantaneous phase slices still remains advantageous for detailed mapping.
Beyer’s seismic terrain is just a natural extension of the instantaneous time slice as ( quoted from ):”… it then follows that the instantaneous phase (-180 deg to +180 deg) can simply be rescaled to the wavelength in ms of pseudoseismic two-way time… Seismic terrain can be thought of as a type of instantaneous wavelength generated from instantaneous phase along a time slice”. With reference to the top dipping reflector in Figure 3, the method allows generating converting the brown phase segment to the dipping blue segment (and similarly the yellow phase segment to the dipping green segment for bottom dipping event).
The practice – Petrel
Let’s see how we can create a terrain display similar to that in Figure 1 using Petrel.
The process starts with migrated seismic data, from which we need generate both the phase and frequency component to get us the instantaneous wavelength.
For this tutorial I use a public seismic dataset (BPA9901) available on the Norwegian Public Data Portal. In Figure 4 below I am showing an amplitude time slice (above the Chalk) from the migrated seismic volume.
Step 1 – generate phase component
The first step is to generate an instantaneous phase attribute volume. This is found in the volume attributes. In Figure 5 below I am showing the instantaneous phase slice corresponding to the amplitude time slice of Figure 4.
*** N.B. *** If significant regional dips are observed in the seismic data, care should be takenin some cases it may be beneficial (please see comment section) to remove them through flattening prior to the terrain generation.
Step 2 – generate frequency component
This is the trickiest part. In theory to get the instantaneous wavelength we would have to calculate the instantaneous frequency and divide the instantaneous frequency attribute can be very noisy and can have spurious values in areas of low amplitude in the input data. A good practical alternative is to measure a single value of the dominant wavelet period T in an area of relatively flat reflections near the zone of interest as I am showing in Figure 6.
For the more avid readers, this is all explained quite nicely in Beyer (quoted from ): “Complex trace relationships dictate that the wavelength is the phase component divided by the frequency component. Thus one may be compelled to derive the seismic terrain by dividing the extracted instantaneous phase by the extracted instantaneous frequency (carefully applying unit conversion of 1000 ms/ 360 deg or 2.78). However extracted instantaneous frequency tends to include spurious values in low amplitudes (approaching infinity according to literature and practice which correspond to poor data quality zones. Instantaneous frequency or even averaged instantaneous frequency renders the seismic terrain noisy, unrealistic, and misleading. Years of extensive use have shown that single value of visually estimated dominant wavelet period (i.e. cycles per second) produces a very high-quality seismic terrain that closely fits the seismic events over wide areas”.
Step 3 – generate the instantaneous wavelength (seismic terrain)
Having estimated the dominant wavelet period T (in ms) , I can now use it to generate the instantaneous wavelength. We are essentially converting the data from the range [-180 180] deg to the range [-T/2 T/2] ms.
In Petrel I do it in the calculator with a formula of the type:
terrain=(((instantaneous phase +180)*T/2)/180)
The actual formula used is shown on the top row of Figure 7 below. You will notice that it isn’t exactly the same as the above formula. I added 1 to 180 to avoid division of zero values, e.g.:
Once the division is performed I subtract T/2 again.
Notice from Figure 7 that because we added 181 but divided by 180 there is a small adjustment to be made by hand. I get this small adjustment by double clicking on the output volume to get the statistics. In this case it is +-0.16 ms, so I run a second time the formula (bottom row, Figure 7), this time subtracting T/2 +0.16 instead of just T/2.
Step 4 – display seismic terrain.
There are two options in Petrel to display the resulting seismic terrain volume:
Option 1 – display bump mapped terrain slices in 2D or 3D window
This is my preferred option for scanning up and down through the terrain slices. The bump mapping effect is done by double clicking on the terrain survey with a 2D or 3D window open and selected, in the Style tab>Intersection tab. In Figure 8 I am showing the bump mapped terrain slice corresponding to the instantaneous phase time slice of Figure 5.
Option 2 – display selected horizons of interest in 3D window
One may want to create a display such as the one in Figure 1, which for me is intended for a later stage, when integrating perhaps with extracted amplitude or attribute anomalies highlighting hydrocarbon presence.
As often in Petrel there are different ways of achieving the same result. This is how I do it. First, I create a flat surface, with a TWT value corresponding to the slice I am interested in (as in Figure 9, left panel). Then I extract and append to this surface the terrain values from that slice of interest (as in Figure 9, right panel).
Finally, in the calculations tab, I add a constant time shift corresponding to the time associated with the slice of interest: notice the difference in Z value between the left panel in Figure 10 (before the calculation), and the right panel (after the calculation). It is also necessary to use the extracted value as visual vertical position as illustrated in Figure 11.
I am showing the result in Figure 12. This is the same terrain slice as in Figure 8.
As a quick QC I am displaying in Figure 13 a vertical section (corresponding to the thick black line in Figure 12) from the input seismic data, with the extracted surface drawn as a thin black line.
The terrain deteriorates to the far left as we approach the edge of the survey, with fold decreasing and noise increasing, and there is a cycle skip towards the far right. But all in all I think this is a very good result: it captures the faults well, and the whole process took less than 20 minutes with no picking.
This method has one limitation: the maximum fault throws or stratigraphic relief (in milliseconds) that can be mapped is equal to the period T.
I wish to thank DONG Energy for agreeing to the publication of the seismic images, which were generated using company licensed Petrel.
** UPDATE **
A few readers asked clarifications on what the benefits and potential uses are of using this attribute. The short answer is that this is in fact a pseudo-horizon that tracks dipping seismic events accurately within the range of the period of the dominant frequency period (as seen in Figure 13), which makes it an excellent reconnaissance tool. A good quality first pass map can be made in minutes in areas where detailed mapping can take days. An excellent example is Figure 16 in the original paper . More details can be found in the last two paragraphs of the paper: Reservoir-scale structural data from seismic terrain and Fast 3-D screening.