2

I am trying to address the following issue. Let's assume a dataframe (loaded from a txt file) with the following structure (and thousands of rows):

foo.head()
         X            Y       Z 
 0  125417.5112  536361.8752 -1750.0
 1  127517.7647  533925.8644 -1750.0
 2  128144.1000  533199.4000 -1750.0
 3  128578.8385  532904.9288 -1750.0
 4  125417.5112  536361.8752 -1750.0
 ....

The data represents X Y and Z coordinates.

I also have a set of points that define a closed polygon. These are in a numpy array:

polypoints

array([[ 125417.5112,  536361.8752],
       [ 127517.7647,  533925.8644],
       [ 128144.1   ,  533199.4   ],
       ....
       [ 125417.5112,  536361.8752]])

How can i filter my dataframe to drop the rows that do NOT fall inside the closed polygon?

I have tried defining the polygon using shapely.geometry polygon. By doing:

poly = Polygon(polypoints)

This works fine. But I am at a loss as to how to continue with this.

Help is much appreciated

----EDIT---- Please see below for updated solution

Red Sparrow
  • 387
  • 1
  • 5
  • 17
  • The classic algorithm for this is to draw a line from the point to infinity that doesn't intersect with any of the points and count how many edges it crosses. Odd for inside, even for outside. – Ignacio Vazquez-Abrams Feb 09 '18 at 17:08
  • Have a look at Geopandas: http://geopandas.org/set_operations.html – Rutger Kassies Feb 10 '18 at 06:36
  • @IgnacioVazquez-Abrams that would work for a more "simple" geometric shape i think. In my case it is a complex polygon so it is not as straight forward. Also, I am looking for a solution that works all the time – Red Sparrow Feb 13 '18 at 13:48
  • @RutgerKassies the overlay function looks really promising. Looking into it at the moment. Need to figure out if it can be applied also with polygon and points – Red Sparrow Feb 13 '18 at 13:49
  • It works with any complexity polygon, it simply gets more expensive to calculate as the number of edges goes up. – Ignacio Vazquez-Abrams Feb 13 '18 at 14:35

3 Answers3

3

The original solution suggested by @MrT works great. However, looking at geopandas as suggested by @Rutger Kassies, i have also found another solution. First one needs to install the geopandas package. Then the following code works for me:

import geopandas as gpd
from shapely.geometry import Point, Polygon, MultiPolygon
# load the data that should be cropped by the polygon
# this assumes that the csv file already includes 
# a geometry column with point data as performed below
dat_gpd = gpd.GeoDataFrame.from_csv(r'{}\data_to_crop.csv'.format(savedir), sep='\t')

# load the data of the polygon as a dataframe
arr_df = pd.DataFrame(data, columns=['X','Y','Z'])

# make shapely points out of the X and Y coordinates
point_data = [Point(xy) for xy in zip(arr_df.X, arr_df.Y)]

# assign shapely points as geometry to a geodataframe
# Like this you can also inspect the individual points if needed
arr_gpd = gpd.GeoDataFrame(arr_df, geometry=point_data)

# define a shapely polygon from X and Y coordinates of the shapely points
polygo = Polygon([[p.x, p.y] for p in arr_gpd.geometry])

# assing defined polygon to a new dataframe
pol_gpd= gpd.GeoDataFrame()
pol_gpd['geometry'] = None
pol_gpd.loc[0,'geometry'] = polygo

# define a new dataframe from the spatial join of the dataframe with the data to be cropped
# and the dataframe with the polygon data, using the within function.
dat_fin = gpd.sjoin(dat_gpd, pol_gpd, op = 'within')

Hope this helps if someone faces a similar problem. Also, further info on the spatial join can be found on the geopandas website. Note that this functionality does not require an operation between polygons, but also works with points and polygons

--EDIT --

%timeit gpd.sjoin(dat_gpd, pol_gpd, op = 'within')
31.8 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit dat_gpd['inpoly'] = dat_gpd.apply(lambda row: polygo.intersects(Point(row["X"], row["Y"])), axis = 1)
1min 26s ± 389 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It seems that the geo-pandas function is much faster. Though to be fair the non-geo pandas solution also has to convert the X and Y to shapely point elements and then perform the intersection evaluation

Red Sparrow
  • 387
  • 1
  • 5
  • 17
  • Thanks for adding your own solution. I had the feeling that shapely and pandas should work together well, since both are based on numpy. A bit disappointing though, that you still have to calculate each polygon separately. You can time the speed for both approaches with the `timeit` module for a data set of medium size to get an idea, which approach will save you time, when you have a lot of data to stem. +1 – Mr. T Feb 19 '18 at 15:07
  • @MrT I have added the timings. There is a difference though i am not sure i am comparing apples to apples here. Also, the code i provided might not be the best way to define shapely elements in geopandas. It does work but might not be the most elegant. Maybe someone can help to make it more pythonic – Red Sparrow Feb 20 '18 at 13:40
  • It is to be expected that a genuine geopandas code works faster. I am not surprised. For performance enhancement, you could put your code under scrutiny on [CodeReview](https://codereview.stackexchange.com/help/on-topic). – Mr. T Feb 20 '18 at 13:48
2

I am not so familiar with shapely. Maybe they have a genuine pandas support. Afaik, they support vectorised numpy functions, so I wouldn't be surprised.
One way of finding out, which points are within a given polygon, would be to use pandas apply() function:

import pandas as pd
from shapely.geometry import Polygon, Point
#your dataframe of points
df = pd.DataFrame([[0, 0, 0], [1, 2, 3], [2, 2, 2], [3, 2, 1] ], columns = list("XYZ"))
#your polygon points
polygon1_list = [(1, 1), (1, 3), (3, 3), (3, 1)]
#adding a column that contains a boolean variable for each point
df["polygon1"] = df.apply(lambda row: Polygon(polygon1_list).contains(Point(row["X"], row["Y"])), axis = 1)
print(df)

Output for my toy data set

   X  Y  Z  polygon1
0  0  0  0   False
1  1  2  3   False
2  2  2  2    True
3  3  2  1   False

In shapely, contains really means within the polygon, this excludes the border. If you want to include the border, you should use intersects

df["polygon1"] = df.apply(lambda row: Polygon(polygon1_list).intersects(Point(row["X"], row["Y"])), axis = 1)

Now the answer to your question is easy. Just drop the rows that contain False in this new column:

df = df.drop(df[~df["polygon1"]].index)

Unfortunately, you still have to loop over the polygon list. It would be interesting, if somebody knew a way, how to test all points and all polygons without an (explicit) loop. I've seen a MultiPolygon constructor class on their website, so maybe combining all polygons in one class would do the trick. But test in advance that this is a valid selection. A MultiPolygon is invalid, if its members touch at an infinite number of points along a line.

Edit: Seemingly, in Python 2.7 this does not work. See akozi's answer for a 2.7 compatible answer.

Mr. T
  • 11,960
  • 10
  • 32
  • 54
  • This actually does the trick! Thanks for the suggestion! Considering the point of Rutger Kassies I am now looking into the possibility to do this completely with geopandas. Might be a cleaner solution with fewer dependencies – Red Sparrow Feb 13 '18 at 14:03
  • I've seen [this yesterday](http://www.mhermans.net/geojson-shapely-geocoding.html) and thought of this thread. Maybe of help for you. Good luck with your project. – Mr. T Feb 13 '18 at 14:08
  • Thanks again, will check it out – Red Sparrow Feb 13 '18 at 14:29
  • For anyone that is trying this in `Python 2.7` you need to use `contains_points` instead of `contains`. It also seems to have trouble with single items. I included a working example edited from this answer in an answer of my own. – akozi Oct 31 '18 at 11:41
1

I had trouble mimicking the exact solution Mr T suggested in Python 2.7. So here is the minor difference I had to make to have it work in Python 2.7.

from shaply.geometry.polygon import Polygon
inside = Polygon(poly_points).contains_points(zip(df.X.values, df.Y.values))
df['inside'] = inside
df = df.drop(df[~df['inside']].index)

It seems that the old version of contains_points had trouble running with a single point. So I set it up to read all of the points and append that list as a new column.

akozi
  • 425
  • 1
  • 7
  • 20