multiprocessing a geopandas.overlay() throws no error but seemingly never completes

Question

I'm trying to pass a geopandas.overlay() to multiprocessing to speed it up. I have used custom functions and functools to partially fill function inputs and then pass the iterative component to the function to produce a series of dataframes that I then concat into one.

def taska(id, points, crs):
    return make_break_points((vms_points[points.ID == id]).reset_index(drop=True), crs)

points_gdf = geodataframe of points with an id field
grid_gdf = geodataframe polygon grid
partialA = functools.partial(taska, points=points_gdf, crs=grid_gdf.crs)
partialA_results =[]
with Pool(cpu_count()-4) as pool:
    for results in pool.map(partialA, list(points_gdf.ID.unique())):
        partialA_results.append(results)
bpts_gdf = pd.concat(partialA_results)

In the example above I use the list of unique values to subset the df and pass it to a processor to perform the function and return the results. In the end all the results are combined using pd.concat.

When I apply the same approach to a list of dataframes created using numpy.array_split() the process starts with a number of processors, then they all close and everything hangs with no indication that work is being done or that it will ever exit.

def taskc(tracks, grid):
    return gpd.overlay(tracks, grid, how='union').explode().reset_index(drop=True)


tracks_gdf = geodataframe of points with an id field
dfs = np.array_split(tracks_gdf, (cpu_count()-4))
grid_gdf = geodataframe polygon grid
partialC_results = []
partialC = functools.partial(taskc, grid=grid_gdf)
with Pool(cpu_count() - 4) as pool:
    for results in pool.map(partialC, dfs):
        partialC_results.append(results)
results_df = pd.concat(partialC_results)

I tried using with get_context('spawn').Pool(cpu_count() - 4) as pool: based on the information here https://pythonspeed.com/articles/python-multiprocessing/ with no change in behavior. Additionally, if I simply run geopandas.overlay(tracks_gdf, grid_gdf) the process is successful and the script carries on to the end with expected results.

Why does the partial function approach work on a list of items but not a list of dataframes? Is the numpy.array_split() not an iterable object like a list? How can I pass a single df into geopandas.overlay() in chunks to utilize multiprocessing capabilities and get back a single dataframe or a series of dataframes to concat?

score 0 · Accepted Answer · answered Dec 16 '22 at 16:02

This is my work around but am also interested if there is a better way to perform this and similar tasks. Essentially, modified the partial function so the df split is moved to the partial function then I create a list of values from range() as my iteral.

def taskc(num, tracks, grid):
    return gpd.overlay(np.array_split(tracks, cpu_count()-4)[num], grid, how='union').explode().reset_index(drop=True)

partialC = functools.partial(taskc, tracks=tracks_gdf, grid=grid_gdf)
dfrange = list(range(0, cpu_count() - 4))
partialC_results = []
with get_context('spawn').Pool(cpu_count() - 4) as pool:
    for results in pool.map(partialC, dfrange):
        partialC_results.append(results)
results_gdf = pd.concat(partialC_results)

multiprocessing a geopandas.overlay() throws no error but seemingly never completes

1 Answers1