6

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.

To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.

I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:

df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)

Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.

Is there a nice way that I can write this out without having to group by the same columns over and over?

mchristos
  • 1,487
  • 1
  • 9
  • 24

1 Answers1

0

Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:

arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
    dfg = dfg.do_stuff1()  # Perform all needed operations
    dfg = do_stuff2(dfg)   #
    arr.append(dfg)

result = pd.concat(arr)

An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:

def all_operations(dfg):
    # Do stuff
    return result_df

result = df.group_by(['vehicle_id', 'day']).apply(all_operations)

In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.

Shovalt
  • 6,407
  • 2
  • 36
  • 51