This weekend – it was long overdue – I cleaned up a notebook prepared last year to accompany the talk “Data science tools for petroleum exploration and production“, which I gave with my friend Thomas Speidel at the Calgary 2018 CSEG/CSPG Geoconvention.

The Python notebook can be downloaded from GitHub as part of a full repository, which includes R code from Thomas, or run interactively with Binder.

In this post I want to highlight on one aspect in particular: doing data exploration visually, but also quantitatively with inferential statistic tests.

Like all data scientists (professional, or in the making, and I ‘park’ myself for now in the latter bin), a scatter matrix is often the first thing I produce, after data cleanup, to look for obvious pairwise relationships and trends between variables. And I really love the flexibility, and looks of `seaborn`

‘s `pairgrid`

.

In the example below I use again data from Lee Hunt, Many correlation coefficients, null hypoteses, and high value CSEG Recorder, December 2013; my followers would be familiar with this small toy dataset from reading Geoscience ML notebook 2 and Geoscience_ML_notebook 3.

The scatter matrix in Figure 1 includes bivariate scatter-plots in the upper triangle, contours in the lower triangle, shape of the bivariate distributions on the diagonal.

matplotlib.pyplot.rcParams["axes.labelsize"] = 18 g = seaborn.PairGrid(data, diag_sharey=False) axes = g.axes g.map_upper(matplotlib.pyplot.scatter, linewidths=1, edgecolor="w", s=90, alpha = 0.5) g.map_diag(seaborn.kdeplot, lw = 4, legend=False) g.map_lower(seaborn.kdeplot, cmap="Blues_d") matplotlib.pyplot.show()

It looks terrific. But, as I said, I’d like to show how to add to the individual scatterplots some useful annotation. These are the ones I typically focus on:

- Spearman rank correlation coefficient – this is a bit more robust than Pearson’s correlation coefficient, but I will come back and add an option for distance correlation
- Confidence interval for the correlation coefficient
- Probability of spurious correlation

For the Spearman correlation coefficient I use `scipy.stats.spearmanr`

, whereas for the confidence interval and the probability of spurious correlation I use my own functions, which I include below (following, respectively, Stan Brown’s Stats without tears and Cynthia Kalkomey’s Potential risks when using seismic attributes as predictors of reservoir properties):

def confInt(r, nwells): z_crit = scipy.stats.norm.ppf(.975) std_Z = 1/numpy.sqrt(nwells-3) E = z_crit*std_Z Z_star = 0.5*(numpy.log((1+r)/(1.0000000000001-r))) ZCI_l = Z_star - E ZCI_u = Z_star + E RCI_l = (numpy.exp(2*ZCI_l)-1)/(numpy.exp(2*ZCI_l)+1) RCI_u = (numpy.exp(2*ZCI_u)-1)/(numpy.exp(2*ZCI_u)+1) return RCI_u, RCI_l

def P_spurious (r, nwells, nattributes): t_of_r = r * numpy.sqrt((nwells-2)/(1-numpy.power(r,2))) p = scipy.stats.t.sf(numpy.abs(t_of_r), nwells-2)*2 ks = numpy.arange(1, nattributes+1, 1) return numpy.sum(p * numpy.power(1-p, ks-1))

I also need a function to calculate the critical r, that is, the value of correlation coefficient above which one can rule out chance as an explanation for a relationship:

def r_crit(nwells, a): t = scipy.stats.t.isf(a, nwells-2) r_crit = t/numpy.sqrt((nwells-2)+ numpy.power(t,2)) return r_crit

where `a`

is equal to `alpha/2`

, alpha being the level of significance, or the chance of being wrong that one accepts to live with, and `nwells-2`

is equivalent to the degrees of freedom.

Finally, I need a utility function (adapted from this Stack Overflow answer) to calculate on the fly and annotate the individual scatterplots with the CC, CI, and P.

def corrfunc(x, y, rc=rc, **kws): r, p = svipy.stats.spearmanr(x, y) u, l = confInt(r, 21) if r > rc: rclr = 'g' else: rclr= 'm' if p > 0.05: pclr = 'm' else: pclr= 'g' ax = matplotlib.pyplot.gca() ax.annotate("CC = {:.2f}".format(r), xy=(.1, 1.25), xycoords=ax.transAxes, color = rclr, fontsize = 14) ax.annotate("CI = [{:.2f} {:.2f}]".format(u, l), xy=(.1, 1.1), xycoords=ax.transAxes, color = rclr, fontsize = 14) ax.annotate("PS = {:.3f}".format(p), xy=(.1, .95), xycoords=ax.transAxes, color = pclr, fontsize = 14)

So, now I just calculate the critical r for the number of observations, 21 wells in this case:

rc = r_crit(21, 0.025)

and wrap it all up this way:

```
matplotlib.pyplot.rcParams["axes.labelsize"] = 18
g = seaborn.PairGrid(data, diag_sharey=False)
axes = g.axes
g.map_upper(matplotlib.pyplot.scatter, linewidths=1,
edgecolor="w", s=90, alpha = 0.5)
g.map_upper(corrfunc)
g.map_diag(seaborn.kdeplot, lw = 4, legend=False)
g.map_lower(seaborn.kdeplot, cmap="Blues_d")
matplotlib.pyplot.show()
```

so that:

- the correlation coefficient is coloured green if it is larger than the critical r, else coloured in purple
- the confidence interval is coloured green if both lower and upper are larger than the critical r, else coloured in purple
- the probability of spurious correlation is coloured in green when below 0.05 (or 5% chance)

Nice, Matteo. I love those 2D KDEs. I wonder how it would look to attach a colourmap to the facecolour of the plot, and show the correlation coefficient that way? I’ve also used a nice thick outline for each plot, and that looks OK.

Thanks Matt. No, I have not tried to use a colormap directly; however I did experiment with replacing the scatterplot in the upper diagonal with a hexbin plot. IT looked OK but needed lots more customization as the y were all shaped slightly different and would probably need handling of outliers for each pair individually. Perhaps I will go back to it and see if I can add colormaps, although in the end I’d say if you have a good number of variables it may become impractical to read/interpret.

Pingback: Data exploration in Python: distance correlation and variable clustering | MyCarta

Pingback: Variable selection in Python, part I | MyCarta

Pingback: Data exploration in Python: distance correlation and variable clustering | MyCarta