Question : Is there a straightforward way to split pandas dataframe into multiple groups based on index, apply a pre-defined function to calculate a new feature and combine the groups back to original shape with extended feature?
I'm trying to enrich features in NYC GreenTaxi Data as shown below,
Loading data and defining process function
import pandas as pd
from geopy.geocoders import Nominatim
import os
import urllib.request
#Downloading the data, if not present
if not os.path.exists('./gc_data_sept15.csv'):
urllib.request.urlretrieve('https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv', './gc_data_sept15.csv')
#load the data
greentaxi_data = pd.read_csv('./gc_data_sept15.csv')
#functionality to get the county based on co-ordinates, using geolocator api of geopy
def getCounty(coords,geolocator):
location = geolocator.reverse(coords)
if location.raw is not None and location.raw['address']['county'] is not None:
return location.raw['address']['county'].split()[0]
else :
return 'Unknown'
Checking functionality over a subset
#define geolocator and derive a Pickup_borough
subset_ = greentaxi_data[:100]
geolocator = Nominatim(timeout=5000)
subset_['Pickup_borough'] = subset_.apply(\
lambda row: getCounty((row['Pickup_latitude'],\
row['Pickup_longitude']),geolocator),axis=1)
This works
>>> subset_['Pickup_borough'].head(5)
0 Kings
1 Bergen
2 Queens
3 Queens
4 Kings
But when I try to do it on the original dataframe inspite of repeated attempts with an higher timeout, I keep getting 'Service timedout' after a certain index.
So here's my question, Is there a way where I could split the dataframe to a subset of say 1000 indexes, apply the function above and combine the result back to original dataframe shape.
Original Data Size :
>>> greentaxi_data.shape
(1494926, 21)