I'm trying out RAPIDS cudf and cuspatial, wonder what are the better ways cross join two dataframes that result in 27billion rows?
I've got two datasets - one from New York City taxi trip data (14.7million rows) containing longitude/latitude of pick up locations. Another dataset contains longitude/latitude of 1.8k metro stations. For each trip I want to cross join with all station location, then calculate the Haversine distance for all permutations.
I don't think cudf allows cross joins so I created a new column key
in both datasets.
cutaxi['key'] = 0
cumetro['key'] = 0
cutaxi_metro = cutaxi.merge(cumetro, on = 'key', how = 'outer')
cutaxi_metro[hdist_km] = cuspatial.haversine_distance(cutaxi_metro[EntranceLongitude], cutaxi_metro[EntranceLatitude],cutaxi_metro[taxi_pickup], cutaxi_metro[taxi_dropoff])
I was running the code on Nvidia V100 and 4 virtual CPUs, but I still ran into out of memory issues. I'm guessing that I need to process the merging in batches, but not sure how to approach it! any suggestions are appreciated!
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory