0

I am experiencing some really weird behavior with pool.starmap in the context of a groupby apply function. Without getting the specifics, my code is something like this:

def groupby_apply_func(input_df):
    print('Processing group # x (I determine x from the df)')
    output_df = pool.starmap(another_func, zip(some fields in input_df))
    return output_df

result = some_df.groupby(groupby_fields).apply(groupby_apply_func)

In words, this takes a dataframe, forms a groupby on it, sends these groups to groupby_apply_func, which does some processing asynchronously using starmap and returns the results, which are concatenated into a final df. pool is a worker pool made from the multiprocessing library.

This code works for smaller datasets without problem. So there are no syntax errors or anything. The computer will loop through all of the groups formed by groupby, send them to groupby_apply_func (I can see the progress from the print statement), and come back fine.

The weird behavior is: on large datasets, it starts looping through the groups. Then, halfway through, or 3/4 way through (which in real time might be 12 hours), it starts completely over at the beginning of the groupbys! It resets the loop and begins again. Then, sometimes, the second loop resets also and so on... and it gets stuck in an infinite loop. Again, this is only with large datasets, it works as intended on small ones.

Could there be something in the apply functionality that, upon running out of memory, for example, decides to start re-processing all the groups? Seems unlikely to me, but I did read that the apply function will actually process the first group multiple times in order to optimize code paths, so I know that there is "meta" functionality in there - and some logic to handle the processing - and it's not just a straight loop.

Hope all that made sense. Does anyone know the inner workings of groupby.apply and if so if anything in there could possibly be causing this?

thx

EDIT: IT APPEARS TO RESET THE LOOP at this point in ops.py ... it gets to this except clause and then proceeds to line 195 which is for key, (i, group) in zip(group_keys, splitter): which starts the entire loop over again. Does this mean anything to anybody?

        except libreduction.InvalidApply as err:
            # This Exception is raised if `f` triggers an exception
            # but it is preferable to raise the exception in Python.
            if "Let this error raise above us" not in str(err):
                # TODO: can we infer anything about whether this is
                #  worth-retrying in pure-python?
                raise
trance_dude
  • 91
  • 1
  • 7

2 Answers2

0

I would use a list of the group dataframes as the argument to map (I don't think you need starmap here), rather than hiding the multiprocessing in the function to be applied.

def func(df):
    # do something
    return df.apply(func2)
    

with mp.Pool(mp.cpu_count()) as p:
    groupby = some_df.groupby(groupby_fields)
    groups = [groupby.get_group(group) for group in groupby.groups] 
    result = p.map(func, groups)

Eric Truett
  • 2,970
  • 1
  • 16
  • 21
  • So yeah I originally had it this way but there are design consequences one of which is that func then needs to be top level or otherwise multiprocessing cannot pickle it. And then you get into needing to use global objects or otherwise pass a lot of info up to the top level. So it's much more convenient in the form I have it in and there's no reason it shouldn't work.... And it does work usually just not for large datasets – trance_dude Feb 06 '21 at 14:58
0

OK so I figured it out. Doesn't have anything to do with starmap. It is due to the groupby apply function. This function tries to call fast_apply over the groupby prior to running "normal" apply. If anything causes an error in that fast_apply loop (in my case it was an out of memory error) it then tries to re-run using "normal" apply. However, it does not print the exception / error and just catches all errors.

Not sure if any Python people will read this but I'd humbly suggest that:

  • if an error really occurs in the fast_apply loop, maybe print it out, rather than catch everything, this could make debugging this like this much easier

  • the logic to re-run the entire loop if fast_apply fails... seems a little weird to me. Probably not a big deal for small apply operations. In my case I had a huge one and I really don't want it re-running the entire thing again. How about: Perhaps give the user an option to NOT use fast_apply - to avoid the whole fast_apply optimization? I don't know the inner workings of it and I'm sure it's in there for a good reason, but it does add complexity and in my case created very confusing situation which took hours to figure out.

trance_dude
  • 91
  • 1
  • 7