0

I am new to machine learning and using GPU - for that reason I was excited about RAPIDs and dask.

I am running on an AWS EC2 p3.8xlarge. On it I am running docker with the RAPIDs container. I am using version 0.16. There is an EBS with 60GB.

I have a data set with about 80 million records. As csv it is about 27GB and as parquet (with a little less features) it is 3.4GB (both cases on AWS S3).

Trying to use dask_cudf using a LocalCUDACluster, I always encounter and issue with crashing workers. Core dumps are created and the execution continues, creating new workers and eventually taking all the storage on my machine.

See below some examples execution showing memory going up, not respecting rmm_pool_size and eventually crashing. I tried many values for rmm_pool_size, both over and under the total GPU memory (from what I understand, it should be able to spill to machine memory).

I am using the following initial code:

from dask_cuda import LocalCUDACluster
from distributed import Client, LocalCluster
import dask_cudf


cluster = LocalCUDACluster(
    rmm_pool_size="60GB"  # I've tried 64, 100, 150 etc. No luck
)
# I also tried setting rmm_managed_memory... 
# I know there are other parameters (ucx, etc) but don't know whether relevant and how to use

client = Client(cluster)

df = dask_cudf.read_parquet("s3://my-bucket/my-parquet-dir/")

I print memory usage:

mem = df.memory_usage().compute()
print(f"total dataset memory: {mem.sum() / 1024**3}GB")

Resulting in

total dataset memory: 50.736539436504245GB

Then, executing my code (whether it is trying to do some EDA, running KNN, or pretty much everything else, I am getting this behavior / error.

I read the docs, I read numerous blogs (from RAPIDS mainly), I run through the notebooks but I am still not able to get it to work. Am I doing something wrong? Will this not work with the setup I have?

Any help would be appreciated...

Example execution - knn

Example execution - persist

Tomer Cagan
  • 1,078
  • 17
  • 31

1 Answers1

1

When setting RMM limits, it's per GPU. So if your goal is 60GB, set RMM to 15. just realized you're only using 4 GPU

JoshP
  • 126
  • 1
  • 4
  • I was following example lile https://medium.com/rapids-ai/reading-larger-than-memory-csvs-with-rapids-and-dask-e6e27dfa6c0f which seems to create a memory pool - larger than over all device memory... are you sure it's per device? First example there is `device_memory_limit` parameter which I assume its set to larger than device memory value (40GB) but now I see that GPU has 48GB memory... – Tomer Cagan Nov 01 '20 at 08:24
  • Positive it's per device. All of RAPIDS follows a one process per GPU paradigm. – JoshP Nov 02 '20 at 14:34
  • 1
    @TomerCagan I've updated those notebook gists to explicitly note that each GPU has 48GB of memory. Hopefully that helps make it more clear! – Nick Becker Nov 03 '20 at 16:06