1

I am trying to use accelerated (GPU backed) computing for distance calculations, but have had a lot of trouble with the nuances between pandas and cudf.

I have a df with vehicles and points in time (lat,lng,timestamp), my cpu based calculation was roughly something like this:

df = pd.read_csv('taxis.csv')

results = df.groupby('vehicle').apply(lambda x: get_distance(x))

where get_distance basically calculates distance between the lat,lon columns achieved using .shift() operator to align successive points.

Trying to use cudf and cuspatial from RAPIDSAI has proven to be very confusing.

I am trying to do :

df = cudf.read_csv("vehicles.csv")
grouped_df = df.groupby("vehicle", method="cudf")
results = grouped_df.apply_grouped(gpu_distance,
                               incols=['lat','lon'],
                               outcols=dict(tot=np.float64))

where my gpu_distance function (what's not working) is


def gpu_distance(lat,lon,tot)

    
    lat1 = lat[1:]
    lon1 = lon[1:]
    lat2 = lat[0:-1]
    lon2 = lon[0:-1]

    distances = cuspatial.haversine_distance(lat1,lon2,lat2,lon2)

    tot = np.sum(distances)



This is not my full use case yet but I am struggling to build it out, getting errors about the actual module

  Unknown attribute 'haversine_distance' of type Module(<module 'cuspatial' from {my RAPIDS installation}

Any ideas with what's going wrong would be appreciated or if there is better documentation on this.

I am able to run the cuspatial.haversine_distance function when not running from within groupby statements, the following code executes normally

# 
lat1 = df.shift()['lat'][1:]
lon1 = df.shift()['lon'][1:]
lat2 = df['lat'][1:]
lon2 = df['lon'][1:]

res = cuspatial.haversine_distance(lat1,lon1,lat2,lon2)
Thomson Comer
  • 3,919
  • 3
  • 30
  • 32

1 Answers1

0

Here's a quick implementation that I think captures what you're looking for:

def shifter(lon, lat, shift_lon, shift_lat):
    for i in range(cuda.threadIdx.x, len(lon)-1, cuda.blockDim.x):
        shift_lon[i] = lon[i+1]
        shift_lat[i] = lat[i+1]
    shift_lon[len(lon)-1] = lon[0]
    shift_lat[len(lat)-1] = lat[0]

sorted_coords = grouped_df.apply_grouped(shifter,
                                        incols=['lon', 'lat'],
                                        outcols={'shift_lon': np.float64,
                                                 'shift_lat': np.float64},
                                        tpb=8)

df['distances'] = 
    cudf.Series(cuspatial.haversine_distance(sorted_coords['shift_lon'],
                                             sorted_coords['shift_lat'],
                                             sorted_coords['lon'],
                                             sorted_coords['lat']))
Thomson Comer
  • 3,919
  • 3
  • 30
  • 32
  • I installed cuspatial from RAPIDS home page with ```conda install -c rapidsai -c nvidia -c conda-forge \ -c defaults rapids=0.15 python=3.7 cudatoolkit=10.2```. I'm able to import it and it works, since I also am able to compute ```haversine_distance``` on an cudf outside of groupby statements. – BernieFeynman Sep 22 '20 at 21:53
  • Ok, that helps! I'll update my response - I think what you're seeing is that the `apply_grouped` `cudf` function doesn't actually have access to `cuspatials` symbols. There should be an easier way, I'll suggest above. – Thomson Comer Sep 22 '20 at 22:21
  • @BernieFeynman I've provided I think a working solution. Let me know if this works for you! – Thomson Comer Sep 30 '20 at 17:11