# Visual data exploration in Python – correlation, confidence, spuriousness

This weekend – it was long overdue –  I cleaned up a notebook prepared last year to accompany the talk “Data science tools for petroleum exploration and production“, which I gave with my friend Thomas Speidel at the Calgary 2018 CSEG/CSPG Geoconvention.

The Python notebook can be downloaded from GitHub as part of a full repository, which includes R code from Thomas, or run interactively with Binder.

In this post I want to highlight on one aspect in particular: doing data exploration visually, but also quantitatively with inferential statistic tests.

Like all data scientists (professional, or in the making, and I ‘park’ myself for now in the latter bin), a scatter matrix is often the first thing I produce, after data cleanup, to look for obvious pairwise relationships and trends between variables. And I really love the flexibility, and looks of `seaborn`‘s `pairgrid`.

In the example below I use again data from Lee Hunt, Many correlation coefficients, null hypoteses, and high value CSEG Recorder, December 2013; my followers would be familiar with this small toy dataset from reading Geoscience ML notebook 2 and Geoscience_ML_notebook 3.

The scatter matrix in Figure 1 includes bivariate scatter-plots in the upper triangle, contours in the lower triangle, shape of the bivariate distributions on the diagonal.

```matplotlib.pyplot.rcParams["axes.labelsize"] = 18
g = seaborn.PairGrid(data, diag_sharey=False)
axes = g.axes
g.map_upper(matplotlib.pyplot.scatter,  linewidths=1,
edgecolor="w", s=90, alpha = 0.5)
g.map_diag(seaborn.kdeplot, lw = 4, legend=False)
g.map_lower(seaborn.kdeplot, cmap="Blues_d")
matplotlib.pyplot.show()```

Scatter matrix from pairgrid

It looks terrific. But, as I said, I’d like to show how to add to the individual scatterplots some useful annotation. These are the ones I typically focus on:

For the Spearman correlation coefficient I use `scipy.stats.spearmanr`, whereas for the confidence interval and the probability of spurious correlation I use my own functions, which I include below (following, respectively, Stan Brown’s  Stats without tears and Cynthia Kalkomey’s Potential risks when using seismic attributes as predictors of reservoir properties):

```def confInt(r, nwells):
z_crit = scipy.stats.norm.ppf(.975)
std_Z = 1/numpy.sqrt(nwells-3)
E = z_crit*std_Z
Z_star = 0.5*(numpy.log((1+r)/(1.0000000000001-r)))
ZCI_l = Z_star - E
ZCI_u = Z_star + E
RCI_l = (numpy.exp(2*ZCI_l)-1)/(numpy.exp(2*ZCI_l)+1)
RCI_u = (numpy.exp(2*ZCI_u)-1)/(numpy.exp(2*ZCI_u)+1)
return RCI_u, RCI_l```
```def P_spurious (r, nwells, nattributes):
t_of_r = r * numpy.sqrt((nwells-2)/(1-numpy.power(r,2)))
p = scipy.stats.t.sf(numpy.abs(t_of_r), nwells-2)*2
ks = numpy.arange(1, nattributes+1, 1)
return numpy.sum(p * numpy.power(1-p, ks-1))```

I also need a function to calculate the critical r, that is, the value of correlation coefficient  above which one can rule out chance as an explanation for a relationship:

```def r_crit(nwells, a):
t = scipy.stats.t.isf(a, nwells-2)
r_crit = t/numpy.sqrt((nwells-2)+ numpy.power(t,2))
return r_crit```

where `a`   is equal to `alpha/2`, alpha being the level of significance, or the chance of being wrong that one accepts to live with, and `nwells-2` is equivalent to the degrees of freedom.

Finally, I need a utility function (adapted from this Stack Overflow answer) to calculate on the fly and annotate the individual scatterplots with the CC, CI, and P.

```def corrfunc(x, y, rc=rc, **kws):
r, p = svipy.stats.spearmanr(x, y)
u, l = confInt(r, 21)
if r > rc:
rclr = 'g'
else:
rclr= 'm'
if p > 0.05:
pclr = 'm'
else:
pclr= 'g'
ax = matplotlib.pyplot.gca()
ax.annotate("CC = {:.2f}".format(r), xy=(.1, 1.25),
xycoords=ax.transAxes, color = rclr, fontsize = 14)
ax.annotate("CI = [{:.2f} {:.2f}]".format(u, l), xy=(.1, 1.1),
xycoords=ax.transAxes, color = rclr, fontsize = 14)
ax.annotate("PS = {:.3f}".format(p), xy=(.1, .95),
xycoords=ax.transAxes, color = pclr, fontsize = 14)```

So, now I just calculate the critical r for the number of observations, 21 wells in this case:

`rc = r_crit(21, 0.025)`

and wrap it all up this way:

```matplotlib.pyplot.rcParams["axes.labelsize"] = 18
g = seaborn.PairGrid(data, diag_sharey=False)
axes = g.axes
g.map_upper(matplotlib.pyplot.scatter,  linewidths=1,
edgecolor="w", s=90, alpha = 0.5)
g.map_upper(corrfunc)
g.map_diag(seaborn.kdeplot, lw = 4, legend=False)
g.map_lower(seaborn.kdeplot, cmap="Blues_d")
matplotlib.pyplot.show()```

so that:

• the correlation coefficient is coloured green if it is larger than the critical r, else coloured in purple
• the confidence interval is coloured green if both lower and upper are larger than the critical r, else coloured in purple
• the probability of spurious correlation is coloured in green when below 0.05 (or 5% chance)

Enhanced scatter matrix

## 5 thoughts on “Visual data exploration in Python – correlation, confidence, spuriousness”

1. Nice, Matteo. I love those 2D KDEs. I wonder how it would look to attach a colourmap to the facecolour of the plot, and show the correlation coefficient that way? I’ve also used a nice thick outline for each plot, and that looks OK.

2. Thanks Matt. No, I have not tried to use a colormap directly; however I did experiment with replacing the scatterplot in the upper diagonal with a hexbin plot. IT looked OK but needed lots more customization as the y were all shaped slightly different and would probably need handling of outliers for each pair individually. Perhaps I will go back to it and see if I can add colormaps, although in the end I’d say if you have a good number of variables it may become impractical to read/interpret.