I am trying to use accelerated (GPU backed) computing for distance calculations, but have had a lot of trouble with the nuances between pandas and cudf.
I have a df with vehicles and points in time (lat,lng,timestamp), my cpu based calculation was roughly something like this:
df = pd.read_csv('taxis.csv')
results = df.groupby('vehicle').apply(lambda x: get_distance(x))
where get_distance
basically calculates distance between the lat,lon columns achieved using .shift()
operator to align successive points.
Trying to use cudf
and cuspatial
from RAPIDSAI has proven to be very confusing.
I am trying to do :
df = cudf.read_csv("vehicles.csv")
grouped_df = df.groupby("vehicle", method="cudf")
results = grouped_df.apply_grouped(gpu_distance,
incols=['lat','lon'],
outcols=dict(tot=np.float64))
where my gpu_distance
function (what's not working) is
def gpu_distance(lat,lon,tot)
lat1 = lat[1:]
lon1 = lon[1:]
lat2 = lat[0:-1]
lon2 = lon[0:-1]
distances = cuspatial.haversine_distance(lat1,lon2,lat2,lon2)
tot = np.sum(distances)
This is not my full use case yet but I am struggling to build it out, getting errors about the actual module
Unknown attribute 'haversine_distance' of type Module(<module 'cuspatial' from {my RAPIDS installation}
Any ideas with what's going wrong would be appreciated or if there is better documentation on this.
I am able to run the cuspatial.haversine_distance
function when not running from within groupby statements, the following code executes normally
#
lat1 = df.shift()['lat'][1:]
lon1 = df.shift()['lon'][1:]
lat2 = df['lat'][1:]
lon2 = df['lon'][1:]
res = cuspatial.haversine_distance(lat1,lon1,lat2,lon2)