Pandas groupby | MyCarta

October 22, 2020

In my last post I touched on the topic of continuously improving your geo-computing projects (also take a look at my chapter from the upcoming Software Underground book, 52 things you should know about geocomputing).

However, one aspect that I intentionally left out in was that of coding skills as I was planning to get back to it with a dedicated post, which you are reading just now.

2018 vs 2020 comparison of flag percentage calculation

In the Jupyter notebook I compare the results of seismic inversion from two methods (with or without inversion-tailored noise attenuation) using a custom function to flag poor prediction of the target well log using median/median absolute deviation as a statistic for the error; the results are shown below.

One may just do this visual comparison, but I also included calculations to count the number and percentage of samples that have been flagged for each case. Below is a cell of code from the Jupyter notebook (let’s call it 2020 code) that does just that .

zone_errors_a['flagged samples']=result_a.groupby('zone', sort=False).flag.sum().values
zone_errors_b['flagged samples']=result_b.groupby('zone', sort=False).flag.sum().values

def calc_proportion(dtf):
    """
    function to calculate proportion of flagged samples
    """
    x=dtf.flag
    return round(100 * x.sum()/len(x), 1)

zone_errors_a['proportion (%)']=result_a.groupby('zone',sort=False).apply(calc_proportion).values
zone_errors_b['proportion (%)']=result_b.groupby('zone',sort=False).apply(calc_proportion).values

I am a lot happier with this code than with the original code (circa 2018), which is in the cell below.

zones_a=list(result_a['zone'].unique())
zones_b=list(result_b['zone'].unique())

zone_errors_a['flagged samples']=[result_a.loc[result_a.zone==z,'flag'].sum() for z in zones_a]
zone_errors_b['flagged samples']=[result_b.loc[result_b.zone==z,'flag'].sum() for z in zones_b]

zone_errors_a['proportion (%)']=[round(result_a.loc[result_a.zone==z,  'flag'].sum()/len(result_a.loc[result_a.zone==z,'flag'])*100,1) for z in zones_a]                                


zone_errors_b['proportion (%)']=[round(result_b.loc[result_b.zone==z,  'flag'].sum()/len(result_b.loc[result_b.zone==z,'flag'])*100,1) for z in zones_b]

The major differences in the older code are:

I was using unique instead of Pandas’ groupby
I was using list comprehensions to work through the DataFrame, instead of Pandas’ apply and a custom function to calculate the percentages on the entire DataFrame at once.

I find the 2020 code much more tidy and easier to read.

Enters Pandas for everyone

The above changes happened in but a few hours over two evenings, after having worked through chapters 9 and 10 of Pandas for Everyone by Daniel Chen, a very accessible read for all aspiring data scientists, which I highly recommend (also, watch Daniel’s fully-packed 2019 Pycon tutorial).

And before you ask: no, you do not get the Agile Scientific sticker with the book, I am sorry.

🙂

Comparison of 2016 vs 2020 code snippets from the 2016 SEG Machine Learning contest

A second example is of code used to calculate the first and second derivatives for all geophysical logs from the wells in the 2016 SEG Machine Learning contest.

The two cells of code below do exactly the same thing: loop through the wells and for each one in turn loop through the logs, calculate the derivatives, add them to a temporary Pandas DataFrame, then concatenate into a single output DataFrame. In this case, the only difference is the moving away from unique to groupby.

I use the %%timeit cell magic to compare the runtimes for the two cells.

2016 code

%%timeit
# for training data
# calculate all 1st and 2nd derivative for all logs, for all wells
train_deriv_df = pd.DataFrame()             # final dataframe

for well in train_data['Well Name'].unique():        # for each well
    new_df = pd.DataFrame() # make a new temporary dataframe
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']: # for each log
        # calculate and write to temporary dataframe
        new_df[str(log) + '_d1'] = np.array(np.gradient(train_feat_df[log][train_feat_df['Well Name'] == well]))
        new_df[str(log) + '_d2'] = np.array(np.gradient(np.gradient(train_feat_df[log][train_feat_df['Well Name'] == well])))
         
    # append all rows of temporary dataframe to final dataframe          
    train_deriv_df = pd.concat([train_deriv_df, new_df])

86 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2020 code

%%timeit
# for training data
# calculate all 1st and 2nd derivative for all logs, for all wells
train_deriv_df = pd.DataFrame() # final dataframe

for _, data in train_feat_df.groupby('Well Name'): # for each well        
    new_df = pd.DataFrame()                        # make a new temporary dataframe
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']: # for each log
        # calculate and write to temporary dataframe 
        new_df[str(log) + '_d1'] = np.gradient(data[log])
        new_df[str(log) + '_d2'] = np.gradient(np.gradient(data[log]))

    # append all rows of temporary dataframe to final dataframe          
    train_deriv_df = pd.concat([train_deriv_df, new_df])

52.3 ms ± 353 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We go down to 52.3 ms from 86 ms, which is a modest improvement, but certainly the code is more compact and a whole lot lighter to read (i.e. more pythonic, or pandaish if you prefer): I am happy!

As an aside, if you want to know more about timing code execution, see section 1.07 from Jake VanderPlas’ outstanding Python Data Science Handbook, which I also cannot recommend enough (and do yourself a favor: watch his series Reproducible Data Analysis in Jupyter).

By the way, below I show the notebook code comparison generated using the nbdiff-web option from the awesome nbdime library, a recent discovery.

With a few hours of work last weekend, I finished putting together a Jupyter notebook tutorial, started at the Geophysics Python sprint 2018, demonstrating how to:

- Use Agile Scientific’s Welly to load two wells with several geophysical logs
- Use Pandas, Welly, and NumPy to: remove all logs except for compressional wave velocity (Vp), shear wave velocity (Vs), and density (RHOB); store the wells in individual DataFrames; make the sampling rate common to both wells; check for null values; convert units from imperial to metric; convert slowness to velocity; add a well name column
- Split the DataFrame by well using unique values in the well name column
- For each group/well use Agile Scientific’s Bruges ‘s Backus average to upscale all curves individually
- Add the upscaled curves back to the DataFrame

Matt Hall, (organizer), told me during a breakfast chat on the first day of the sprint that this tutorial would be a very good to have since it is one of the most requested examples by neophyte users of the Bruges library; I was happy to oblige.

The code for the most important bit, the last two items in the above list, is included below:

# Define parameters for the Backus filter
lb = 40   # Backus length in meters
dz = 1.0  # Log sampling interval in meters

# Do the upscaling work
wells_bk = pd.DataFrame()
grouped = wells['well'].unique()  
for well in grouped:
    new_df = pd.DataFrame()
    Vp = np.array(wells.loc[wells['well'] == well, 'Vp'])
    Vs = np.array(wells.loc[wells['well'] == well, 'Vs'])
    rhob = np.array(wells.loc[wells['well'] == well, 'RHOB'])
    Vp_bks, Vs_bks, rhob_bks = br.rockphysics.backus(Vp, Vs, rhob, lb, dz)
    new_df['Vp_bk'] = Vp_bks
    new_df['Vs_bk'] = Vs_bks
    new_df['rhob_bk'] = rhob_bks
    wells_bk = pd.concat([wells_bk, new_df])

# Add to the input DataFrame
wells_final = (np.concatenate((wells.values, wells_bk.values), axis=1)) 
cols = list(wells) + list(wells_bk) 
wells_final_df = pd.DataFrame(wells_final, columns=cols)

And here is a plot comparing the raw and upscaled Vp and Vs logs for one of the wells:

Please check the notebook if you want to try the full example.

MyCarta

A blog about Geoscience, Visualization, Data Science, AI

Tag Archives: Pandas groupby

Keep advancing your Python coding skills

2018 vs 2020 comparison of flag percentage calculation

Enters Pandas for everyone

Comparison of 2016 vs 2020 code snippets from the 2016 SEG Machine Learning contest

2016 code

2020 code

Like this:

Upscaling geophysical logs with Python using Pandas and Bruges

Like this:

2018 vs 2020 comparison of flag percentage calculation

Enters Pandas for everyone

Comparison of 2016 vs 2020 code snippets from the 2016 SEG Machine Learning contest

2016 code

2020 code

Share this:

Like this:

Share this:

Like this: