I have a DataFrame of 30 million longitude and latitudes in NYC. I want to map each coordinate to a census tract and preferably have these census tracts as another column in the DataFrame.
Currently, I am using Shapely and PySpark to achieve this. I am using the map function of PySpark and it takes a long time. It takes roughly 0.2 seconds to map each coordinate to a census tract. I want to see if I can do this faster, say using Megellan.
import shapefile
Census_Tracts_Shapefile_Path = 'Data/NYC Census Tracts/NYC/nyct2010wi.shp'
from shapely.geometry import Polygon, Point
CensusTract_Shapes = CensusTract_Shapefile.shapes()
Polygons = [shape.points for shape in CensusTract_Shapes]
def Census_Tract_Finder(x,y,Polygons):
try:
x = float(x); y = float(y); OK = 1
except ValueError:
OK = 0
if OK == 1:
point = Point(float(x), float(y));
Tract = []
for counter in range(len(Polygons)):
Poly = Polygon(Polygons[counter])
if Poly.contains(point):
Tract.append(counter)
return Tract
else: return []
# In this section, I filter the census tracts
# to find the ones that are in Manhattan
Manhattan_CT = []
CT_Records = CensusTract_Shapefile.shapeRecords()
for counter in range(len(CT_Records)):
if int(CT_Records[counter].record[1]) == 1:
Manhattan_CT.append(counter)
CT_Records_Manhattan = [CT_Records[index] for index in Manhattan_CT]
Polygons_Manhattan = [Polygons[index] for index in Manhattan_CT]
# An example of how I look for the Census tract of each point:
# print Census_Tract_Finder('-73.986191','40.760681', Polygons_Manhattan)
Start = time(); N = 1000 # For testing purposes, I focus on the first N rows.
dd = df_Spark # This is the Spark dataframe that I have already loaded.
Output = dd.rdd .map(lambda x:Census_Tract_Finder(x['longitude'],x['latitude'],Polygons_Manhattan)) .take(N)
Duration = time()-Start
print 'Output Calculations:', Duration