9

I have two sets of shapefiles with polygons. One set of shapefile is just the US counties I'm interested in and this varies across firms and years. The other set of shapefile is the business area of firms and of course this varies across firms and years. I need to get the intersection of these two layers for each firm in each year. So far the function overlay(df1, df2, how = 'intersection') accomplished my goal. But it takes around 300s for each firm-year. Given that I have a long list of firms and many years, this would take me days to finish. Is there any way to enhance this performance?

I notice that if I do the same thing in ArcGIS, the 300s comes down to a few seconds. But I'm a new user of ArcGIS, not familiar with the python in it yet.

Crystie
  • 385
  • 6
  • 12

3 Answers3

6

If you look at the current geopandas overlay source code, they've actually updated the overlay function to utilize Rtree spatial indexing! I don't think that doing doing the Rtree manually would be any faster (actually will probably be slower) at this point in time.

See source code here: https://github.com/geopandas/geopandas/blob/master/geopandas/tools/overlay.py

pasalacquaian
  • 61
  • 1
  • 1
  • Is that in geopandas 0.6.2 ? Its not mentioned here: https://github.com/geopandas/geopandas/releases – Mike Honey Feb 02 '20 at 11:56
  • It was updated in release 0.4.0. So, it was done a while ago ~2018 – pasalacquaian Feb 05 '20 at 19:06
  • I have been trying to use multiprocessing and numpy to split one of the input dataframes and pass it to it's own worker but for vary large datasets it never seems to finish. – MrKingsley Mar 22 '23 at 12:15
3

Hopefully you've figured this out by now, but the solution is to utilize Geopanda's R-tree spatial index. You can achieve orders of magnitude improvement by implementing it appropriately.

Goeff Boeing has written an excellent tutorial.

http://geoffboeing.com/2016/10/r-tree-spatial-index-python/

andrew
  • 3,929
  • 1
  • 25
  • 38
0

I have a similar situation trying to count tracks intersecting cells in a large grid; I ended up using multiprocessor to pass the overlay function to individual processors. Something like this should work...

from geopandas import read_file, overlay
from multiprocessing import Pool

def func(year):
    df1 = read_file(r'Some dir'+'df1'+str(year)+'.shp') #You will have to modify this with your file details
    df2 = read_file(r'Some dir'+'df2'+str(year)+'.shp')#You will have to modify this with your file details
    overlay(df1, df2, how='intersection').to_file('overlay'+str(year))


if __name__ == '__main__':
    years = [#A list of all the years or file prefixes\suffixes#]
    n = #Number of processors\workers you can call#
    with Pool(n) as pool:
        for results in pool.imap(func,years):
        pass
    pool.close()

I had to make some assumptions like each shapefile being named something with a year prefix. Depending on PC resources you might want to limit the number of workers by assigning something to n or you could leave Pool(n) to Pool() and let your system sort it out.

MrKingsley
  • 171
  • 1
  • 10