In the first post of this series I showed how to use Pandas, Seaborn, and Matplotlib to:
- load a dataset
- test, clean up, and summarize the data
- start looking for relationships between variables using scatterplots and correlation coefficients
In this second post, I will expand on the latter point by introducing some tests and visualizations that will help highlight the possible criteria for choosing some variables, and dropping others. All in Python.
I will use a different dataset than that in the previous post. This one is from the paper “Many correlation coefficients, null hypotheses, and high value“(Lee Hunt, CSEG Recorder, December 2013).
The target to be predicted is oil production from a marine barrier sand. We have measured production (in tens of barrels per day) and 7 unknown (initially) predictors, at 21 wells.
Hang on tight, and read along, because it will be a wild ride!
I will show how to:
1) automatically flag linearly correlated predictors, so we can decide which might be dropped. In the example below (a matrix of pair-wise correlation coefficients between variables) we see that X2, and X7, the second and third best individual predictors of production (shown in the bottom row) are also highly correlated to X1, the best overall predictor.
2) automatically flag predictors that fail a critical r test
3) create a table to assess the probability that a certain correlation is spurious, in other words the probability of getting at least the correlation coefficient we got with our the sample, or even higher, purely by chance.
I will not recommend to run these tests and apply the criteria blindly. Rather, I will suggest how to use them to learn more about the data, and in conjunction with domain knowledge about the problem at hand (in this case oil production), make more informed choices about which variables should, and which should not be used.
And, of course, I will show how to make the prediction.
Have fun reading: get the Jupyter notebook on GitHub.