Keep advancing your Python coding skills

Featured

October 22, 2020

In my last post I touched on the topic of continuously improving your geo-computing projects (also take a look at my chapter from the upcoming Software Underground book, 52 things you should know about geocomputing).

However, one aspect that I intentionally left out in was that of coding skills as I was planning to get back to it with a dedicated post, which you are reading just now.

2018 vs 2020 comparison of flag percentage calculation

In the Jupyter notebook I compare the results of seismic inversion from two methods (with or without inversion-tailored noise attenuation) using a custom function to flag poor prediction of the target well log using median/median absolute deviation as a statistic for the error; the results are shown below.

One may just do this visual comparison, but I also included calculations to count the number and percentage of samples that have been flagged for each case. Below is a cell of code from the Jupyter notebook (let’s call it 2020 code) that does just that .

zone_errors_a['flagged samples']=result_a.groupby('zone', sort=False).flag.sum().values
zone_errors_b['flagged samples']=result_b.groupby('zone', sort=False).flag.sum().values

def calc_proportion(dtf):
"""
function to calculate proportion of flagged samples
"""
x=dtf.flag
return round(100 * x.sum()/len(x), 1)

zone_errors_a['proportion (%)']=result_a.groupby('zone',sort=False).apply(calc_proportion).values
zone_errors_b['proportion (%)']=result_b.groupby('zone',sort=False).apply(calc_proportion).values

I am a lot happier with this code than with the original code (circa 2018), which is in the cell below.

zones_a=list(result_a['zone'].unique())
zones_b=list(result_b['zone'].unique())

zone_errors_a['flagged samples']=[result_a.loc[result_a.zone==z,'flag'].sum() for z in zones_a]
zone_errors_b['flagged samples']=[result_b.loc[result_b.zone==z,'flag'].sum() for z in zones_b]

zone_errors_a['proportion (%)']=[round(result_a.loc[result_a.zone==z,  'flag'].sum()/len(result_a.loc[result_a.zone==z,'flag'])*100,1) for z in zones_a]                                


zone_errors_b['proportion (%)']=[round(result_b.loc[result_b.zone==z,  'flag'].sum()/len(result_b.loc[result_b.zone==z,'flag'])*100,1) for z in zones_b]                                    

The major differences in the older code are:

  • I was using unique instead of Pandas’ groupby
  • I was using list comprehensions to work through the DataFrame, instead of Pandas’ apply and a custom function to calculate the percentages on the entire DataFrame at once.

I find the 2020 code much more tidy and easier to read.

Enters Pandas for everyone

The above changes happened in but a few hours over two evenings, after having worked through chapters 9 and 10 of Pandas for Everyone by Daniel Chen, a very accessible read for all aspiring data scientists, which I highly recommend (also, watch Daniel’s fully-packed 2019 Pycon tutorial).

And before you ask: no, you do not get the Agile Scientific sticker with the book, I am sorry.

ūüôā

Comparison of 2016 vs 2020 code snippets from the 2016 SEG Machine Learning contest

A second example is of code used to calculate the first and second derivatives for all geophysical logs from the wells in the 2016 SEG Machine Learning contest.

The two cells of code below do exactly the same thing: loop through the wells and for each one in turn loop through the logs, calculate the derivatives, add them to a temporary Pandas DataFrame, then concatenate into a single output DataFrame. In this case, the only difference is the moving away from unique to groupby.

I use the %%timeit cell magic to compare the runtimes for the two cells.

2016 code
%%timeit
# for training data
# calculate all 1st and 2nd derivative for all logs, for all wells
train_deriv_df = pd.DataFrame()             # final dataframe

for well in train_data['Well Name'].unique():        # for each well
    new_df = pd.DataFrame() # make a new temporary dataframe
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']: # for each log
        # calculate and write to temporary dataframe
        new_df[str(log) + '_d1'] = np.array(np.gradient(train_feat_df[log][train_feat_df['Well Name'] == well]))
        new_df[str(log) + '_d2'] = np.array(np.gradient(np.gradient(train_feat_df[log][train_feat_df['Well Name'] == well])))
         
    # append all rows of temporary dataframe to final dataframe          
    train_deriv_df = pd.concat([train_deriv_df, new_df])

86 ms ¬Ī 1.47 ms per loop (mean ¬Ī std. dev. of 7 runs, 10 loops each)
2020 code
%%timeit
# for training data
# calculate all 1st and 2nd derivative for all logs, for all wells
train_deriv_df = pd.DataFrame() # final dataframe

for _, data in train_feat_df.groupby('Well Name'): # for each well        
    new_df = pd.DataFrame()                        # make a new temporary dataframe
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']: # for each log
        # calculate and write to temporary dataframe 
        new_df[str(log) + '_d1'] = np.gradient(data[log])
        new_df[str(log) + '_d2'] = np.gradient(np.gradient(data[log]))

    # append all rows of temporary dataframe to final dataframe          
    train_deriv_df = pd.concat([train_deriv_df, new_df])

52.3 ms ¬Ī 353 ¬Ķs per loop (mean ¬Ī std. dev. of 7 runs, 10 loops each)

We go down to 52.3 ms from 86 ms, which is a modest improvement, but certainly the code is more compact and a whole lot lighter to read (i.e. more pythonic, or pandaish if you prefer): I am happy!

As an aside, if you want to know more about timing code execution, see section 1.07 from Jake VanderPlas’ outstanding Python Data Science Handbook, which I also cannot recommend enough (and do yourself a favor: watch his series Reproducible Data Analysis in Jupyter).

By the way, below I show the notebook code comparison generated using the nbdiff-web option from the awesome nbdime library, a recent discovery.

Geoscience Machine Learning bits and bobs – data completeness

2016 Machine learning contest – Society of Exploration Geophysicists

In a previous post I showed how to use pandas.isnull to find out, for each well individually, if a column has any null values, and sum to get how many, for each column. Here is one of the examples (with more modern, pandaish syntax compared to the example in the previous post:

for well, data in training_data.groupby('Well Name'): 
print(well)
print (data.isnull().values.any())
print (data.isnull().sum(), '\n')

Simple and quick, the output showed met that  Рfor example Рthe well ALEXANDER D is missing 466 samples from the PE log:

ALEXANDER D
True
Facies         0
Formation      0
Well Name      0
Depth          0
GR             0
ILD_log10      0
DeltaPHI       0
PHIND          0
PE           466
NM_M           0
RELPOS         0
dtype: int64

A more appealing and versatile alternative, which I discovered after the contest, comes with the matrix function form the missingno library. With the code below I can turn each well into a Pandas DataFrame on the fly, then a missingno matrix plot.

for well, data in training_data.groupby('Well Name'): 

msno.matrix(data, color=(0., 0., 0.45)) 
fig = plt.gcf()
fig.set_size_inches(20, np.round(len(data)/100)) # heigth of the plot for each well reflects well length 
axes=fig.get_axes()
axes[0].set_title(well, color=(0., 0.8, 0.), fontsize=14, ha='center')

I find that looking at these two plots provides a very compelling and informative way to inspect data completeness, and I am wondering if they couldn’t be used to guide the strategy to deal with missing data, together with domain knowledge from petrophysics.

Interpreting the dendrogram in a top-down fashion, as suggested in the library documentation, my first thoughts are that this may suggest trying to predict missing values in a sequential fashion rather than for all logs at once. For example, looking at the largest cluster on the left, and starting from top right, I am thinking of testing use of GR to first predict missing values in RDEP, then both to predict missing values in RMED, then DTC. Then add CALI and use all logs completed so far to predict RHOB, and so on.

Naturally, this strategy will need to be tested against alternative strategies using lithology prediction accuracy. I would do that in the context of learning curves: I am imagining comparing the training and crossvalidation error first using only non NaN rows, then replace all NANs with mean, then compare separately this sequential log completing strategy with an all-in one strategy.

Upscaling geophysical logs with Python using Pandas and Bruges

With a few hours of work last weekend, I finished putting together a Jupyter notebook tutorial, started at the Geophysics Python sprint 2018, demonstrating how to:

    • Use Agile Scientific’s¬†Welly to load two wells with several geophysical logs
    • Use¬†Pandas,¬†Welly, and NumPy¬†to: remove all logs except for compressional wave velocity (Vp), shear wave velocity (Vs), and density (RHOB); store the wells in individual DataFrames; make the sampling rate common to both wells; check for null values; convert units from imperial to metric; convert slowness to velocity; add a well name column
    • Split the DataFrame by well using unique values in the well name column
    • For each group/well use Agile Scientific’s¬†Bruges¬†‘s Backus¬†average to upscale all curves individually
    • Add the upscaled curves back to the DataFrame

Matt Hall, (organizer), told me during a breakfast chat on the first day of the sprint that this tutorial would be a very good to have since it is one of the most requested examples by neophyte users of the Bruges library; I was happy to oblige.

The code for the most important bit, the last two items in the above list, is included below:

# Define parameters for the Backus filter
lb = 40   # Backus length in meters
dz = 1.0  # Log sampling interval in meters

# Do the upscaling work
wells_bk = pd.DataFrame()
grouped = wells['well'].unique()  
for well in grouped:
    new_df = pd.DataFrame()
    Vp = np.array(wells.loc[wells['well'] == well, 'Vp'])
    Vs = np.array(wells.loc[wells['well'] == well, 'Vs'])
    rhob = np.array(wells.loc[wells['well'] == well, 'RHOB'])
    Vp_bks, Vs_bks, rhob_bks = br.rockphysics.backus(Vp, Vs, rhob, lb, dz)
    new_df['Vp_bk'] = Vp_bks
    new_df['Vs_bk'] = Vs_bks
    new_df['rhob_bk'] = rhob_bks
    wells_bk = pd.concat([wells_bk, new_df])

# Add to the input DataFrame
wells_final = (np.concatenate((wells.values, wells_bk.values), axis=1)) 
cols = list(wells) + list(wells_bk) 
wells_final_df = pd.DataFrame(wells_final, columns=cols)

And here is a plot comparing the raw and upscaled Vp and Vs logs for one of the wells:

backus

Please check the notebook if you want to try the full example.

Geophysics Python sprint 2018 – day 2 and beyond, part II

In the last post I wrote about what Volodymyr and I worked on during a good portion of day two of the sprint in October, and continued to work on upon our return to Calgary.

In addition to that I also continued to work on a notebook example, started in day one, demonstrating on how to upscale sonic and density logs from more than one log at a time using Bruges¬†‘ backusand Panda’s groupby. This will be the focus of a future post.

The final thing I did was to write, and test an error_flag function for Bruges. The function calculates the difference between a predicted and a real curve; it flags errors in prediction if the difference between the curves exceeds a user-defined distance (in standard deviation units) from the mean difference. Another option available is to check whether the curves have opposite slopes (for example one increasing, the other decreasing within a specific interval). The result is a binary error log that can then be used to generate QC plots, to evaluate the performance of the prediction processes in a more (it is my hope) insightful way.

The inspiration for this stems from a discussion over coffee I had 5 or 6 years ago with Glenn Larson, a Geophysicist at Devon Energy, about the limitations of (and alternatives to) using a single global score when evaluating the result of seismic inversion against wireline well logs (the ground truth). I’d been holding that in the back of my mind for years, then finally got to it last Fall.

flag_full

Summary statistics can also be calculated by stratigraphic unit, as demonstrated in the accompanying Jupyter Notebook.

Having fun learning Python with your kids

When I was a kid I loved video games. Now … we’re talking about 70s and 80s games: my first home computer was a Texas Instruments TI-99/4A, my second one a Commodore 64. I loved all games in those days, no matter how primitive they might seem like now; you can get a glimpse of WHAT we are talking about in this video. And funny enough, I still really do love them; below is one of my favorites,¬†H.E.R.O.¬†which I can play these days thanks to a C64 emulator.

HERO

Unfortunately for me, I did not catch (not then, at least) the programming bug. What really turned me down was having tried to type thousands of lines of BASIC to run what looked like an awesome adventure game from a command list in a magazine, only to never be able to play the game because the list was full of typos.

So, I want to make it better for both my daughter, AND myself,¬† because I plan to make more games. First of all, we are working with Python. There are a lot of resources for making games in Python; a really good one is Al Sweigart‘s¬†Invent Your Own Computer Games with Python (as an aside:¬†I have been using a chapter here, a chapter there from many of his books, it is awesome that he made them available online, but it was time to give back so I bought the 4th edition of Invent Your Own Computer Games with Python).

To set the stage for our first game I reviewed the first few chapters of the book, particularly Chapter 5: Dragon Realm, to get some ideas on how to use Python to make interactive games, but then decided to start with our own game: a calculator that would talk to the player, via the terminal, and ask for numbers and for their favorite algebraic operation. However, we added a twist because it sounded a lot more fun: IF her mom were to play, we’d want the results to be occasionally wrong. My daughter worked out some of the logic, after suggesting the twist, and I did the heavy lifting with Python; in the remainder of the post I will work through how we did it. Please note this game is written for Python 3.

In the first block, Jester, the calculator, introduces itself and then asks the player for their name; the name is stored to a variable via the input function and used for greeting and throughout the game. The capitalize() method turns the first letter of the name string to upper case, if needed.

print ('Hello, I am a talking calculator and my name is Jester.') 
name = input('What is your name? ') 
print ('Hello ' + name.capitalize() + ', nice to meet you!')

The real game starts in the big block below:

while True:
  operation = input(name.capitalize() + 
', would you like to add or subtract? (enter "a" or "s"; "q" to quit) ')
  if operation.lower() =='q':
    print('Goodbye ' + name.capitalize())
    break
  elif operation.lower() =='s' or operation.lower() =='a':
    print ('OK, ready.')
    input1 = input('Can I have the first number? ')
    try:
        num1=int(input1)
    except ValueError:
        print ("That is not a number!")
        input1 = input('Can I have the first number? ')
    input2 = input('Can I have the second number? ')
    try:
       num2=int(input2)
    except ValueError:
        print("That is not a number!")
        input2 = input('Can I have the second number? ')
    num1=int(input1)
    num2=int(input2)

Let’s break it up. First, the player chooses an operation (addition or subtraction) again via the input function. The¬†while loop allows the player to continue playing after each operation.

while True:
  operation = input(name.capitalize() + 
', would you like to add or subtract? (enter "a" or "s"; "q" to quit) ')

The following if statement is used to exit the while loop (using break) if the player decides to quit.

  if operation.lower() =='q':
    print('Goodbye ' + name.capitalize())
    break

An¬†elif statement comes next, allowing the player to chose the operation; at the same time if neither ‘a’ nor ‘s’ are chosen, the program will go back to the beginning of the while loop.

  elif operation.lower() =='s' or operation.lower() =='a':
    print ('OK, ready.')

Finally, Jester asks for the first number to be added or subtracted. We want to be sure that the string passed by the user in here is a number, so we use a try¬†clause to handle a possible exception (for example a letter is entered instead of a number). Here is the logic: if when prompted for the first number we typed a letter by accident (for example the string ‘w’, instead of the string ‘3’ , when¬†input1¬† is passed to int we’d get this fatal error and the program would be interrupted:

>>> num1=int(input1)
>>> ValueError: invalid literal for int() with base 10: 'w'

With the¬†try¬†clause we instead handle the exception by printing the custom error message “That is not a number!“, followed by a new request for a number:

    input1 = input('Can I have the first number? ')
    try:
        num1=int(input1)
    except ValueError:
        print ("That is not a number!")
        input1 = input('Can I have the first number? ')
    num1=int(input1)

The above cycle is repeated for the second number. After Jester gets two legitimate integers to work with, the operation part begins, which includes¬†Jester’s prank. Indeed, depending on the name of the player, the result will be either always correct, or sometimes correct, sometimes a bit off (which is done using random). We picked “Olivia” to represent the “lucky” object of the prank. This is the only hard-coded part of the program. It needs to be changed with the name of the intended “victim”.

There are two nested if statements below that allow for the following to happen: if the player’s name is Olivia, a random number between 1 and 3, rand1¬†is drawn.¬†¬†Then the¬† value of¬†rand1 is checked: if¬†if it is 3 then a non-zero number, num3, of value drawn between 1 and 5 (rand2) will be added to the result of the operation to make it wrong; otherwise the operation will execute correctly.

from random import randint
num3 = 0 # initializes random error variable
  if (name.lower() == 'olivia'): # for this player ....
    rnd1 = randint(1,3)
    rnd2 = randint(1,5)
    if rnd1 == 3: # .... if a 3 is drawn the correct result 
      num3 = rnd2 # .... will be modified by this random amount

Finally, this is the addition and subtraction block: in both cases unless “Olivia” is the player num3 will be zero and the result will be correct.

  if operation.lower() == 'a':
    result = num1 + num2 + num3
    print (str(num1) + ' + '  + str(num2) + ' = ' + str(result))
  elif operation.lower() == 's':
    result = (max(num1, num2) - min(num1, num2)) - num3
    print (str(max(num1, num2)) + ' - ' 
         + str(min(num1, num2))  + ' = ' + str(result))

If you want to play the game, please copy the full program below and paste it in a .py file and give it a try. Enjoy!

# NB optimized for Python 3
from random import randint

print ('Hello, I am a talking calculator and my name is Jester.')
name = input('What is your name? ')
print ('Hello ' + name.capitalize() + ', nice to meet you!')

while True:
  operation = input(name.capitalize() +
  ', would you like to add or subtract? (enter "a" or "s"; "q" to quit) ')

  if operation.lower() =='q':
    print('Goodbye ' + name.capitalize())
    break

  elif operation.lower() =='s' or operation.lower() =='a':
    print ('OK, ready.')
    input1 = input('Can I have the first number? ')
    try:
        num1=int(input1)
    except ValueError:
       print ("That is not a number!")
       input1 = input('Can I have the first number? ')

    input2 = input('Can I have the second number? ')
    try:
       num2=int(input2)
    except ValueError:
        print("That is not a number!")
        input2 = input('Can I have the second number? ')

    num1=int(input1)
    num2=int(input2)

  num3 = 0
  if (name.lower() == 'olivia'):
    rnd1 = randint(1,3)
    rnd2 = randint(1,5)
    if rnd1 == 3:
      num3 = rnd2

  if operation.lower() == 'a':
    result = num1 + num2 + num3
    print (str(num1) + ' + '  + str(num2) + ' = ' + str(result))

  elif operation.lower() == 's':
    result = (max(num1, num2) - min(num1, num2)) - num3
    print (str(max(num1, num2)) + ' - '
           + str(min(num1, num2))  + ' = ' + str(result))

 

Geophysics Python sprint 2018 – day 2 and beyond, part I

In my last post I wrote about what I did on day one of the Geophysics sprint run by Agile Scientific in Santa Ana two weeks ago.

This post and the next one are about the project Volodymyr and I worked on during day two of the sprint, and continued to work on upon our return to Calgary.

We had read a great notebook by Alessandro Amato del Monte (I recommend browsing his Geophysical notes repo) showing how to reconstruct a velocity log from density with optimized alpha and beta parameters for the Inverse Gardner function, found via scipy.curve_fit.

Inspired by that, we set out with a dual goal:

  • First, we wanted to adapt Alessandro’s optimization idea so that it would work with¬†Bruges‘ Inverse Gardner
  • Second, we wanted to adapt a function from some old work of mine¬†to flag sections of the output velocity log with poor prediction; this would be useful to learn where¬†alpha and beta may need to be tweaked because of changes in the rock lithology or fluid content

I’ll walk you through some of our work. Below are the two functions:

# Alessandro's simple inverse Gardner
def inv_gardner(rho, alpha, beta):
    return (rho/alpha)**(1/beta)

# Bruges' inverse Gardner
def inverse_gardner(rho, alpha=310, beta=0.25, fps=False):
    """
    Computes Gardner's density prediction from P-wave velocity.
    Args:
        rho (ndarray): Density in kg/m^3.
        alpha (float): The factor, 310 for m/s and 230 for fps.
        beta (float): The exponent, usually 0.25.
        fps (bool): Set to true for FPS and the equation will use the typical
            value for alpha. Overrides value for alpha, so if you want to use
            your own alpha, regardless of units, set this to False.
        Returns:
            ndarray: Vp estimate in m/s.
    """
    alpha = 230 if fps else alpha
    exponent = 1 / beta
    factor = 1 / alpha**exponent
    return factor * rho**exponent

They look similarly structured, and take the same arguments. We can test them by passing a single density value and alpha/beta pair.

inv_gardner(2000, 0.39, 0.23)
>>> 1.349846231542594e+16

inverse_gardner(2000, 0.39, 0.23)
>>> 1.3498462315425942e+16

Good. So the next logical step would be to define some model density and velocity data (shamelessly taken from Alessandro’s notebook, except we now use Bruges’ Gardner with S.I. units) and pass the data, and Bruges’ inverse Gardner toscipy.curve_fit¬†to see if it does just work; could it be that simple?

# Make up random velocity and density with Bruges' direct Gardner
vp_test = numpy.linspace(1500, 5500)
rho_test = gardner(vp_test, 310, 0.25)
noise = numpy.random.uniform(0.1, 0.3, vp_test.shape)*1000
rho_test = rho_test + noise

The next block is only slightly different from Alessandro’s¬†notebook. Instead of using all data, we splits both density and velocity into two pairs of arrays: a rho12 and vp2 to optimize foralpha and beta,¬† a rho1 for calculating “unknown” velocities vp_calc1 further down; the last one,¬†v1, will be used just to show where the real data might have been had we not had to calculate it.

idx = np.arange(len(vp_test))
np.random.seed(3)
spl1 = np.random.randint(0, len(vp_test), 15)
spl2 = np.setxor1d(idx,spl1)
rho1 = rho_test[spl1]
rho2 = rho_test[spl2]
vp1= vp_test[spl1] # this we pretend we do not have 
vp2= vp_test[spl2]

Now, as in Alessandro’s¬†notebook, we pass simple inverse Gardner function to scipy.curve_fit to find optimal alpha and beta parameters, and we printalpha and beta.

popt_synt2, pcov2 = scipy.curve_fit(inv_gardner,rho2, vp2)
print (popt_synt2)
>>> [3.31376056e+02 2.51257203e-01]

Those values seem reasonable, but just to be sure let’s calculate vp_calc1¬†from¬†rho1 and plot everything to be sure.

vp_calc1 = inv_gardner(rho1, *popt_synt2)

# this is to show the fit line
rho_synt_fit=np.linspace(1, 3000, 50)
vp_synt_fit=inv_gardner(rho_synt_fit, *popt_synt2)

plt.figure(figsize=(10, 10))
plt.plot(rho2,vp2,'or', markersize = 10, label = "fitted points")
plt.plot(rho1,vp1,'ob', markersize = 10, alpha = 0.4, label = "calculated points")
plt.plot(rho1,vp1,'ok', markersize = 10, label = "withheld points")
plt.plot(rho_synt_fit, vp_synt_fit, '-r', lw=2, 
         label='Fit' r'$ V_p=(%.2f / \rho)^{1/%.2f}$' %(popt_synt2[0], 
                                                     popt_synt2[1]))
plt.xlabel('Density rho [kg/m^3]'), plt.xlim(1800, 3000)
plt.ylabel('Velocty Vp [m/s]'), plt.ylim(1000, 6000)
plt.grid()
plt.legend(loc='upper left')
plt.show()

plot

That looks great. Let’s now try the same using Bruges’ Inverse Gardner.

popt_synt2, pcov2 = curve_fit(inverse_gardner, rho2, vp2)
print (popt_synt2)
>>> [1.         0.29525991 1.        ]

That is odd, we do not get the same parameters; additionally, there’s this error message:

../scipy/optimize/minpack.py:794:
OptimizeWarning: Covariance of the parameters could not be estimated 
category=OptimizeWarning)

One possible explanation is that although both inv_gardner and inverse_gardner take three parameters, perhaps scipy.curve_fit does not know to expect it because in the latter alpha and betaare pre-assigned.

The workaround for this was to write a wrapper function to ‘map’ between the call signature of¬†scipy.curve_fit¬†and that of inverse_gardner¬†so that it would be ‘communicated’ to the former explicitly.

def optimize_inverse_gardner(rho, alpha, beta):
    return inverse_gardner(rho, alpha=alpha, beta=beta)

popt_synt2, pcov2 = scipy.curve_fit(optimize_inverse_gardner, 
                                    rho2, vp2) 
print (popt_synt2)
>>> [3.31376060e+02 2.51257202e-01]

Which is the result we wanted.

In the next post we will apply this to some real data and show how to flag areas of poorer results.

Geophysics Python sprint 2018 – day 1

Last weekend I went to California to attend my first ever Python sprint, which was organized at MAZ Café con leche (Santa Ana) by Agile Scientific.

For me this event was a success in many respects. First of all, I wanted to spend some dedicated time working on an open source project, rather than chipping away at it once in a while. Also, participating in a project that was not my own seemed like a good way to challenge myself, by pushing me out of a zone of comfort. Finally, this was an opportunity to engage with other members of the Software Underground Slack team, some of which (for example Jesper Dramsch and¬†Brendon Hall) I’ve known for some time but actually never met in person.

Please read about the Sprint in general on Matt Hall‘s blog post,¬†Caf√© con leche.¬†My post is a short summary of what I did on the first day.

After a tasty breakfast, and at least a good hour of socializing, I sat at a table with three other people interested in working on Bruges¬†(Agile’s¬†Python library for Geophysics) : Jesper Dramsch,¬†Adriana Gordon and Volodymyr Vragov.

As I tweeted that evening, we had a light-hearted start, but then we set to work.Screen Shot 2018-10-21 at 11.11.52 AM

While Adriana and Jesper tackled Bruges’ documentation, which was sorely needed, Volodymyr spent some hours on example notebooks from¬†in-Bruges (a tour of Bruges), which needed fixing, and also on setting up our joint project for day 2 (more in the next post). For my part,¬†I¬† put together a tutorial notebooks on how to use Bruges’ functions on wireline logs stored in a Pandas DataFrame. According to Matt, this is requested quite often, so it seemed like a good choice.

Let’s say that a number of wells are stored in a DataFrame with both a depth column, and a well name column, in addition to log curves.

The logic for operating on logs individually is this:
Split the wells in the DataFrame using groupby, then
for each well
for each of the logs of interest
do something using one of Bruges’ functions (for example apply a rolling mean)

The code to do that is surprisingly simple, once you’ve figure it out (I myself struggle often, and not little with Pandas at the outset of new projects).

One has to first create a list with the logs of interest, like so:

logs = ['GR', 'RHOB']

then define the length of the window for the rolling operation:

window = 9

finally, the logic above is applied as:

wells_sm=pd.DataFrame()

grouped=wells['well'].unique()

for well in grouped:    
  new_df=pd.DataFrame()   
    for log in logs:
      sm=br.filters.mean(np.array(wells[log][wells['well']==well]),
                         window)
        new_df[str(log) + '_sm']=sm 
    wells_sm=pd.concat([wells_sm, new_df])

where wells_sm is a temporary DataFrame for the filtered logs, which can be added back to the original DataFrame with:

wells_filtered = (np.concatenate((wells.values, 
                  wells_sm.values), axis=1))
cols = list(wells) + list(wells_sm)
wells_filtered_df = pd.DataFrame(wells_filtered, columns=cols)

You can work through the full example in the notebook.

Machine Learning in Python: classification using Support Vector Machines and Scikit-learn

This post is a short extract, with minor modifications,  from my recently released article on the check the CSEG Recorder Machine Learning in Geoscience V: Introduction to Classification with SVMs.

Understanding classification with Support Vector Machines

Support Vector Machines are a popular type of algorithm used in classification, which is the process of¬†¬†“…identifying to which of a set of¬†categories¬†(sub-populations) a new observation¬†belongs¬†(source: Wikipedia).

In classification, the output variable is a category, for example ‚Äėsand‚Äô, or ‚Äėshale‚Äô, and the main task of the¬†process is the creation of a dividing boundary between the classes. This boundary will be a line in a bi-dimensional space (only two features used to classify), a surface in a three dimensional space (three features), and a hyper-plane in a higher- dimensional space. In this article I will use interchangeably the terms hyper-plane, boundary, and decision surface.

Defining the boundary may sound like a simple task, especially with two features (a bidimensional scatterplot), but it underlines the important concept of generalization, as pointed out by Jake VanderPlas in his¬†Introduction to Scikit-Learn, because ‚ÄĚ… in drawing this separating line, we have learned a model which can generalize to new data: if you were to drop a new point onto the plane which is unlabeled, this algorithm could now predict‚Ķ‚ÄĚ the class it belongs to.

Let’s use a toy classification problem to understand in more detail how in practice SVMs achieve the class separation and find the hyperplane. In the figure below I show an idealized version (with far fewer points) of a Vp/Vs ratio versus P-impedance crossplot from Amato del Monte (2017, Seismic rock physics, tutorial on The Leading Edge).  I’ve added three possible boundaries (dashed lines) separating the two classes.

Each boundary is valid, but are they equally good? Well, for the SVM classifier, they are not because the classifier looks for the boundary with the largest distance from the nearest point in either of the classes.

These points, called Support Vectors, are the most representative of each class, and typically the most difficult to classify. They are also the only ones that matter; if a Support Vector is moved, the boundary will also move. However, if any other point is moved, provided that it is not moved into the margin or across the boundary, it would have no effect on the boundary. This makes SVM classifiers insensitive to outliers (points very far away from the rest of the points in their class and from the boundary) and also less memory intensive than other classifiers (for example, the perceptron). The process of finding this boundary is referred to as ‚Äúmaximizing the margin‚ÄĚ, where the margin is a corridor with no data points between the boundary and the support vectors. The larger this buffer, the lower the generalization error; conversely, small margins are almost invariably associated with over-fitting. We will see more on this in a subsequent section.

So, to go back to the question, which of the three proposed boundaries is the best one (and by ‚Äúbest‚ÄĚ I am referring to the one that will generalize better to unseen data)? Based on what we‚Äôve learned so far, it would have to be the green boundary. Indeed, the orange one is so close to its support vectors (the two points circled with orange) that it leaves virtually no margin; the purple boundary is slightly better (the support vectors are the points circled with purple) but its margin is still quite small compared to the green boundary.

Maximizing the margin is the goal of the SVM classifier, and it is a constrained optimization problem. I refer interested readers to Hearst (1998, Support Vector Machines,¬†IEEE Intelligent Systems); however, I will quote a definition from that paper (with reference to Figure 1 and accompanying text) as it yields further understanding: ‚Äú… the optimal hyper-plane is orthogonal to the shortest line connecting the convex hulls of the two classes, and intersects it half way‚ÄĚ.

In the inset in the figure, I zoomed closer to the 4 points near the green boundary; I’ve also drawn the convex hulls for the classes, the margin, and the shortest orthogonal line, which is bisected by the hyper-plane. I have selected (by hand) the best hyper-plane already (the green one), but if you can imagine rotating a line to span all possible orientations in the empty space close to the two classes without intersecting either of the hulls and find the one with the largest margin, you’ve just done quadratic optimization in your head. Moreover, you’ve turned a crossplot into a decision surface (quoted from Sebastian Thrun,  Intro to Machine Learning, Udacity 120 course).

If you are interested in learning more about Support Vector Machines in an intuitive way, and then how to try classification in practice (using Python and the Scikit-learn library), read the full article here, check the GitHub repo, then read How good is what? (blog post by Evan Bianco of Agile Scientific) for an example and DIY evaluation of  classifier performance.