In my last two posts I published part 1 and part 2 of this Machine Learning quiz. If you have not read them, please do (and cast your votes) before you read part 3 below.
QUIZ, part 3: vote responses and (some) answers
In part 1 I asked which predictions looked “better”: those from model A or those from model B (Figure 1)?

Figure 1
As a reminder, both model A and model B were trained to predict the same labeled facies picked by a geologist on core, shown on the left columns (they are identical) of the respective model panels. The right columns in each panels are the predictions.
The question is asked out of context, with no information given about the training process, and or difference in data manipulation (if any) and/or model algorithm used. Very unfair, I know! And yet, ~78% of 54 respondent clearly indicated their preference for model A. My sense is that this is because model A looks overall smoother and has less of the extra misclassified thin layers.

Response 1
In part 2, I presented the two predictions, this time accompanied by a the confusion matrix for each model (Figure 2).

Figure 2
I asked again which model would be considered better [1] and this was the result:

Response 2a
Although there were far fewer votes (not as robust a statistical sample) I see that the proportion of votes is very similar to that in the previous response, and decidedly in favor of model A, again. However, the really interesting learning, and to me surprising, came from the next answer (Response 2b): about 82% of the 11 respondents believe the performance scores in the confusion matrix to be realistic.

Response 2b
Why was it a surprise? It is now time to reveal the trick…..
…which is that the scores in part 2, shown in the confusion matrices of Figure 2, were calculated on the whole well, for training and testing together!!
A few more details:
- I used default parameters for both models
- I used a single 70/30 train/test split (the same random split for both models) with no crossvalidation
which is, in essence, how to NOT do Machine Learning!
In Figure 3, I added a new column on the right of each prediction showing in red which part of the result is merely memorized, and in black which part is interpreted (noise?). Notice that for this particular well (the random 70/30 split was done on all wells together) the percentages are 72.5% and 27.5%.
I’ve also added the proper confusion matrix for each model, which used only the test set. These are more realistic (and poor) results.

Figure 3
So, going back to that last response: again, with 11 votes I don’t have solid statistics, but with that caveat in mind one might argue that this is a way you could be ‘sold’ unrealistic (as in over-optimistic) ML results.
At least you could sell them by being vague about the details to those not familiar with the task of machine classification of rock facies and its difficulties (see for example this paper for a great discussion about resolution limitations inherent in using logs (machine) as opposed to core (human geologist).
Acknowledgments
A big thank you goes to Jesper (Way of the Geophysicist) for his encouragement and feedback, and for brainstorming with me on how to deliver this post series.