I am experiencing some really weird behavior with pool.starmap in the context of a groupby apply function. Without getting the specifics, my code is something like this:
def groupby_apply_func(input_df):
print('Processing group # x (I determine x from the df)')
output_df = pool.starmap(another_func, zip(some fields in input_df))
return output_df
result = some_df.groupby(groupby_fields).apply(groupby_apply_func)
In words, this takes a dataframe, forms a groupby on it, sends these groups to groupby_apply_func, which does some processing asynchronously using starmap and returns the results, which are concatenated into a final df. pool is a worker pool made from the multiprocessing library.
This code works for smaller datasets without problem. So there are no syntax errors or anything. The computer will loop through all of the groups formed by groupby, send them to groupby_apply_func (I can see the progress from the print statement), and come back fine.
The weird behavior is: on large datasets, it starts looping through the groups. Then, halfway through, or 3/4 way through (which in real time might be 12 hours), it starts completely over at the beginning of the groupbys! It resets the loop and begins again. Then, sometimes, the second loop resets also and so on... and it gets stuck in an infinite loop. Again, this is only with large datasets, it works as intended on small ones.
Could there be something in the apply functionality that, upon running out of memory, for example, decides to start re-processing all the groups? Seems unlikely to me, but I did read that the apply function will actually process the first group multiple times in order to optimize code paths, so I know that there is "meta" functionality in there - and some logic to handle the processing - and it's not just a straight loop.
Hope all that made sense. Does anyone know the inner workings of groupby.apply and if so if anything in there could possibly be causing this?
thx
EDIT: IT APPEARS TO RESET THE LOOP at this point in ops.py ... it gets to this except clause and then proceeds to line 195 which is for key, (i, group) in zip(group_keys, splitter): which starts the entire loop over again. Does this mean anything to anybody?
except libreduction.InvalidApply as err:
# This Exception is raised if `f` triggers an exception
# but it is preferable to raise the exception in Python.
if "Let this error raise above us" not in str(err):
# TODO: can we infer anything about whether this is
# worth-retrying in pure-python?
raise