0

I'm trying out RAPIDS cudf and cuspatial, wonder what are the better ways cross join two dataframes that result in 27billion rows?

I've got two datasets - one from New York City taxi trip data (14.7million rows) containing longitude/latitude of pick up locations. Another dataset contains longitude/latitude of 1.8k metro stations. For each trip I want to cross join with all station location, then calculate the Haversine distance for all permutations.

I don't think cudf allows cross joins so I created a new column key in both datasets.

cutaxi['key'] = 0
cumetro['key'] = 0
cutaxi_metro = cutaxi.merge(cumetro, on = 'key', how = 'outer')
cutaxi_metro[hdist_km] = cuspatial.haversine_distance(cutaxi_metro[EntranceLongitude], cutaxi_metro[EntranceLatitude],cutaxi_metro[taxi_pickup], cutaxi_metro[taxi_dropoff])

I was running the code on Nvidia V100 and 4 virtual CPUs, but I still ran into out of memory issues. I'm guessing that I need to process the merging in batches, but not sure how to approach it! any suggestions are appreciated!

MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
byc
  • 121
  • 10
  • 1
    Cross joining a 14.7M row dataset with an 18K dataset will give you 264.6 billion rows. You'll need to use cuDF + Dask, as this output would be far too large to store in memory on a single GPU. It's at least a terabyte if every value is an int32 (four bytes). You'd likely want multiple GPUs to process this, if you want to do the cross join. https://docs.rapids.ai/api/cudf/stable/10min.html – Nick Becker Dec 27 '20 at 16:16
  • thanks for the comment - noticed I made a typo in my question. It should be 1.8k metro stations so the cross join will return about 27 billion rows. Is there a place where I can find out how much memory and number of GPUs I need? – byc Dec 28 '20 at 06:05
  • Were you able to find a good guide on how much memory to allocate for operations like cross-join? I have a similar problem running when spark rapids on EMR. – c74ckds Mar 09 '21 at 20:15

0 Answers0