group_by() %>% mutate() using pandas

While I have my issues with the tidyverse, one feature I am enamored with is the ability to assign values to observations in grouped data without aggregating the data.  This assigning is done by using the mutate() command instead of summarize().  I am in the middle of some data processing in a Python pipeline where I need to do the same thing, and it took me much longer to get working than I thought it would.

I of course am not the first person to try this operation.  Googling phrases such as “pandas equivalent of dplyr mutate”, “pandas gropuby apply examples”, and “pandas groupby list comprehension” did not help.  The sixth result to the query “pandas custom function to apply” got me to a solution, and it ended up being as easy as I hoped it would be.  For some reason, the answers to the earlier queries were convoluted or not quite right; lambda functions, transform(), etc. were all less user friendly than I needed.

The solution is to pass a function, custom or not, to the apply() call after groupby().  The function inherits the grouped data.  It should create the new column you want and return  the grouped data.  It is possible the function can take arguments, but that would add a wrinkle I do not want to deal with.  Instead, I hardcode the column to analyze.  My solution:

from scipy import stats
def getPercentile(data):
   data['temp'] = [stats.percentileofscore(data['user.followers_count'], item) for item in data['user.followers_count']]
   return data

hooray = data.groupby(['place.country_code', 'yearmonth']).apply(getPercentile)

At the end of the day, the Python + pandas solution is as easy as R + dplyr.  For some reason, figuring out the solution was hard.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.