RAPIDS out of memory when merging cuda dataframe and distance calculations

Question

I'm trying out RAPIDS cudf and cuspatial, wonder what are the better ways cross join two dataframes that result in 27billion rows?

I've got two datasets - one from New York City taxi trip data (14.7million rows) containing longitude/latitude of pick up locations. Another dataset contains longitude/latitude of 1.8k metro stations. For each trip I want to cross join with all station location, then calculate the Haversine distance for all permutations.

I don't think cudf allows cross joins so I created a new column key in both datasets.

cutaxi['key'] = 0
cumetro['key'] = 0
cutaxi_metro = cutaxi.merge(cumetro, on = 'key', how = 'outer')
cutaxi_metro[hdist_km] = cuspatial.haversine_distance(cutaxi_metro[EntranceLongitude], cutaxi_metro[EntranceLatitude],cutaxi_metro[taxi_pickup], cutaxi_metro[taxi_dropoff])

I was running the code on Nvidia V100 and 4 virtual CPUs, but I still ran into out of memory issues. I'm guessing that I need to process the merging in batches, but not sure how to approach it! any suggestions are appreciated!

MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory

Cross joining a 14.7M row dataset with an 18K dataset will give you 264.6 billion rows. You'll need to use cuDF + Dask, as this output would be far too large to store in memory on a single GPU. It's at least a terabyte if every value is an int32 (four bytes). You'd likely want multiple GPUs to process this, if you want to do the cross join. https://docs.rapids.ai/api/cudf/stable/10min.html — Nick Becker, Dec 27 '20 at 16:16
thanks for the comment - noticed I made a typo in my question. It should be 1.8k metro stations so the cross join will return about 27 billion rows. Is there a place where I can find out how much memory and number of GPUs I need? — byc, Dec 28 '20 at 06:05
Were you able to find a good guide on how much memory to allocate for operations like cross-join? I have a similar problem running when spark rapids on EMR. — c74ckds, Mar 09 '21 at 20:15

RAPIDS out of memory when merging cuda dataframe and distance calculations

0 Answers0