4

I have a GeoDataFrame with a column of shapely.polygons. Some of them distinct, some - not:

In [1]: gdf
Out[2]:
    geometry
1   POLYGON ((1 1, 1 2, 2 2, 2 1, 1 1))
2   POLYGON ((1 3, 1 4, 2 4, 2 3, 1 3))
3   POLYGON ((1 1, 1 2, 2 2, 2 1, 1 1))
4   POLYGON ((3 1, 3 2, 4 2, 4 1, 3 1))
5   POLYGON ((1 3, 1 4, 2 4, 2 3, 1 3))

I need to find only distinct (non-overlapping) polygons:

In [1]: gdf_distinct
Out[2]:
    geometry
1   POLYGON ((1 1, 1 2, 2 2, 2 1, 1 1))
2   POLYGON ((1 3, 1 4, 2 4, 2 3, 1 3))
4   POLYGON ((3 1, 3 2, 4 2, 4 1, 3 1))

As polygons are not hashable, I can not use simple ways in Pandas:

In [1]: gdf_distinct = gdf['geometry'].unique()

TypeError: unhashable type: 'Polygon'

Are there any simple and efficient ways to have a new GeoDataFrame with only distinct polygons?

P.S.:

I found one way, but it works only with fully-duplicate polygons and, as I think, not very efficient:

In [1]: m = []
        for index, row in gdf.iterrows():]
           if row['geometry'] not in m:
              m.append(row['geometry'])
        gdf_distinct = GeoDataFrame(geometry=m)
Headshaker
  • 51
  • 4
  • First of all, I've tried df.polygon_colum.unique() - with no success, as polygons are not hashable. Now I'm working around intersects(), when I iterate through all rows to check if any of polygons in column intersects the specified one,but it seems to be not the best way to do it – Headshaker Nov 24 '17 at 11:40
  • 1
    Please edit our question to include the code of what you have tried. A well-formatted [mcve] is much more likely to attract answers. – MechMK1 Nov 24 '17 at 11:42
  • 1
    Made some editing - hope it clears the situation. – Headshaker Nov 24 '17 at 12:01
  • Do you want the unique polygons, or the non-overlapping polygons? As this is not necessarily the same? – joris Nov 24 '17 at 14:59
  • In my case I want to have non-overlapping polygons (just because my data is possibly dirty) – Headshaker Nov 24 '17 at 17:25
  • If you just want to identify unique polygons, you _could_ do `.unique()` on the WKT representation of the geometry, i.e. serialising the geometry first. – alphabetasoup Dec 23 '21 at 00:00

2 Answers2

5

Let's start with a list of 4 polygons, three of which overlap other polygons:

from shapely.geometry import Polygon
import geopandas

polygons = [
    Polygon([[1, 1], [1, 3], [3, 3], [3, 1], [1, 1]]),
    Polygon([[1, 3], [1, 5], [3, 5], [3, 3], [1, 3]]),
    Polygon([[2, 2], [2, 3.5], [3.5, 3.5], [3.5, 2], [2, 2]]),
    Polygon([[3, 1], [3, 2], [4, 2], [4, 1], [3, 1]]),
]
gdf = geopandas.GeoDataFrame(data={'A': list('ABCD')}, geometry=polygons)
gdf.plot(column='A', alpha=0.75)

They look like this:

enter image description here

So we can loop through each one, then loop through all of the others and check for overlaps with the shapely API. If there are not any overlaps, we'll append it to our output list:

non_overlapping = []
for p in polygons:
    overlaps = []
    for g in filter(lambda g: not g.equals(p), polygons):
        overlaps.append(g.overlaps(p))

    if not any(overlaps):
        non_overlapping.append(p)

Any that gives me:

['POLYGON ((3 1, 3 2, 4 2, 4 1, 3 1))']

Which is what I'd expect.

But this is effectively O(N^2), and I don't think it has to be.

So let's try to never check the same pair twice:

non_overlapping = []
for n, p in enumerate(polygons[:-1], 1):  # don't include the last element
    overlaps = []
    for g in polygons[n:]:  # loop from the next element to the end
        overlaps.append(g.overlaps(p))

    if not any(overlaps):
        non_overlapping.append(str(p))

And I get the same result and it's a smidge faster on my machine.

We can compress the loop a bit by using generator in the if statement instead of a normal for block:

non_overlapping = []
for n, p in enumerate(polygons[:-1], 1):
    if not any(p.overlaps(g) for g in polygons[n:]):
        non_overlapping.append(p)

Same story.

Paul H
  • 65,268
  • 20
  • 159
  • 136
  • Thanks for great explanation and example! – Headshaker Nov 27 '17 at 08:57
  • Shouldn't it be `for g in polygons[n+1:]`, to avoid checking self-overlap? – alphabetasoup Dec 23 '21 at 01:06
  • Shapely `overlaps` returns True if more than one _but not all_ points are in common, so this doesn't produce an incorrect result, but `n+1` will still skip n more unnecessary checks. – alphabetasoup Dec 23 '21 at 01:25
  • @alphabetasoup in my example, using n+1 inserts the same polygon twice into the `non_overlapping` list twice. This is because I'm starting `enumerate` from 1 instead of 0. – Paul H Dec 23 '21 at 13:34
  • Ah right, I missed that. I adapted this to use `iterrows` over a GeoDataFrame, and so n starts from 0 in my case. Then I append the row index to the list for a later `iloc` (I need to apply one process to non-overlapping polygons, and another to overlapping polygons). I'd be interested to see if you have further tips when doing this with a GeoDataFrame, e.g. `itertuples` instead of `iterrows`? – alphabetasoup Dec 23 '21 at 19:35
0

Thanks @Paul H for the great answer and @alphabetasoup for the thoughtful comment.

While my solution doesn't answer the question differently, it is related. My use case involved finding only the overlapping polygons. For this I made a small code modification and found that I needed to include the last element so that I didn't lose one of the overlapping polygons.

# Find polygons in a geopandas dataframe that overlap with another polygon 
# in the same dataframe as well as non-overlapping polygons
overlapping = []
non_overlapping = []
for n, p in enumerate(list(gdf.geometry)[:], 1):  # Included the last element
    overlaps = []
    for g in list(gdf.geometry)[n:]:
        overlaps.append(g.overlaps(p))
    if any(overlaps):
        overlapping.append(p)  
    if not any(overlaps):
        non_overlapping.append(p)  # Did not store as string

My use case also required retaining other columns from the original geopandas geodataframe. Here's how I did that:

overlapping = []
non_overlapping = []
for n, p in enumerate(list(gdf.geometry)[:], 0):  # Used Pythonic zero-based indexing
    if any(p.overlaps(g) for g in list(gdf.geometry)[n:]):
        # Store the index from the original dataframe
        overlapping.append(n)
    if not any(p.overlaps(g) for g in list(gdf.geometry)[n:]):
        non_overlapping.append(n)

# Create a new dataframes and reset their indexes
gdf_overlapping = gdf.iloc[overlapping]  
gdf_overlapping.reset_index(drop=True, inplace=True)
gdf_non_overlapping = gdf.iloc[non_overlapping]
gdf_non_overlapping.reset_index(drop=True, inplace=True)