How can I implement Point-in-Polygon in Megallan in PySpark?

Question

I have a DataFrame of 30 million longitude and latitudes in NYC. I want to map each coordinate to a census tract and preferably have these census tracts as another column in the DataFrame.

Currently, I am using Shapely and PySpark to achieve this. I am using the map function of PySpark and it takes a long time. It takes roughly 0.2 seconds to map each coordinate to a census tract. I want to see if I can do this faster, say using Megellan.

import shapefile
Census_Tracts_Shapefile_Path = 'Data/NYC Census Tracts/NYC/nyct2010wi.shp'
from shapely.geometry import Polygon, Point
CensusTract_Shapes = CensusTract_Shapefile.shapes() 
Polygons = [shape.points for shape in CensusTract_Shapes]

def Census_Tract_Finder(x,y,Polygons):
    try:
        x = float(x); y = float(y); OK = 1
    except ValueError:
        OK = 0
    if OK == 1:
        point = Point(float(x), float(y));
        Tract = []
        for counter in range(len(Polygons)):
            Poly = Polygon(Polygons[counter])
            if Poly.contains(point):
                Tract.append(counter)
        return Tract
    else: return []

# In this section, I filter the census tracts 
# to find the ones that are in Manhattan
Manhattan_CT = []
CT_Records = CensusTract_Shapefile.shapeRecords()
for counter in range(len(CT_Records)):
    if int(CT_Records[counter].record[1]) == 1:
        Manhattan_CT.append(counter)

CT_Records_Manhattan = [CT_Records[index] for index in Manhattan_CT]
Polygons_Manhattan = [Polygons[index] for index in Manhattan_CT]

# An example of how I look for the Census tract of each point:
# print Census_Tract_Finder('-73.986191','40.760681', Polygons_Manhattan)

Start = time(); N = 1000 # For testing purposes, I focus on the first N rows.
dd = df_Spark # This is the Spark dataframe that I have already loaded.
Output = dd.rdd .map(lambda x:Census_Tract_Finder(x['longitude'],x['latitude'],Polygons_Manhattan)) .take(N)
Duration = time()-Start
print 'Output Calculations:', Duration

For me it's not clear what you want to achieve. I suggest providing implementation in Python without Spark. — Piotr Kalański, Jun 24 '17 at 17:42

How can I implement Point-in-Polygon in Megallan in PySpark?

0 Answers0