DASK GPU running out of memory with unified allocator

Question

I asked a somewhat similar question the other day. I have tried pinging the DASK slack forum, and their discourse forum but to no avail.

I am currently trying to create a large memory on the CPU, move chunks of data to the GPU to perform multiplication, and then move it back to the CPU. I keep getting a memory error, even for matrices of size (512, 512, 1000).

I have searched the web and some pointed out that a problem could be the memory allocator, which can be set to be done automatically. However, I keep getting the memory error.

import cupy as cp
import numpy as np
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import cudf 

if __name__ == '__main__':
    
    cluster = LocalCUDACluster('0', n_workers=1)
    client = Client(cluster)    
    client.run(cudf.set_allocator, "managed")


    shape = (512, 512, 1000)
    chunks = (100, 100, 1000)

    huge_array_gpu = da.ones_like(cp.array(()), shape=shape, chunks=chunks)
    array_sum = da.multiply(huge_array_gpu, 17).compute()

Am I overlooking something?

The above works if I change the da.multiply to da.sum. No real idea why. — JOKKINATOR, Oct 17 '22 at 10:12
da.multiply and da.sum most certainly have different memory footprints, so it's possible that you're not ever really going out of memory in the latter case. — pentschev, Oct 17 '22 at 15:37

score 2 · Accepted Answer · answered Oct 17 '22 at 14:21

You are attempting to set the cuDF allocator but only using CuPy to compute. Each library has their own allocator you need to set accordingly. The proper way to achieve what you are trying to do is to do a few modifications, enabling unified memory directly for LocalCUDACluster, and then setting CuPy's allocator to use RMM (RAPIDS Memory Manager, which cuDF utilizes under-the-hood).

To achieve the above, you will need to import RMM, change how you start the cluster to add rmm_managed_memory=True and set CuPy's allocator to use RMM. Note that cuDF's default allocator is RMM, so when you set rmm_managed_memory=True in LocalCUDACluster, cuDF will implicitly use managed memory, unlike CuPy.

Also note that you are using compute() at the end. That call will bring the data back to the client, in other words, it will transfer a copy from the Dask GPU Cluster back to the client's GPU. A more appropriate way of executing work in Dask is to use persist(), which will compute the work but keep the results on the cluster for further consumption. If you bring the data back to your client, given you're using GPU 0 only, you will have the Dask GPU Worker and Dask Client competing for the same GPU memory, which may eventually cause out-of-memory errors. What you can try doing in that case is to also set the client to use managed memory with RMM and set CuPy's allocator accordingly.

The complete code should look like the following:

import cupy as cp
import numpy as np
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import rmm

if __name__ == '__main__':

    cluster = LocalCUDACluster('0', rmm_managed_memory=True)
    client = Client(cluster)
    client.run(cp.cuda.set_allocator, rmm.rmm_cupy_allocator)

    # Here we set RMM/CuPy memory allocator on the "current" process,
    # i.e., the Dask client.
    rmm.reinitialize(managed_memory=True)
    cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

    shape = (512, 512, 30000)
    chunks = (100, 100, 1000)

    huge_array_gpu = da.ones_like(cp.array(()), shape=shape, chunks=chunks)
    array_sum = da.multiply(huge_array_gpu, 17).persist()
    # `persist()` only does lazy evaluation, so we must `wait()` for the
    # actual compute to occur.
    wait(array_sum)

    # Bring data back to client if necessary
    # array_sum.compute()

Thanks a lot for your input. That's awesome and it works! I can run it and I see the swaps happening. Does the computation actually take place when calling wait()? I tried printing the array after executing, and it just prints the shape and chunks. — JOKKINATOR, Oct 17 '22 at 16:23
Calling `wait()` does compute, you should be able to see GPU utilization with `nvidia-smi`, but the results are kept on the worker. To visualize the result you would eventually need to call `compute()` or something similar, depending on what you need to do with the results. Not bringing data back to the client implicitly is by design, as the operation of bringing the result back can be costly (due to GPU, network bandwidth, etc.) as well as consume too much memory depending on the size of the result, and usually should be avoided, especially for intermediate results. — pentschev, Oct 17 '22 at 20:30
Thanks a lot, @pentschev. One thing I have noticed is that I can not run this program twice in a row. However, when I delete dask-worker-space/storage, then I can run it again immediately after. Is this a known bug? I am getting the Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join: Error if I run the program immediately after each other. If I erase the folder, I can run it again. — JOKKINATOR, Oct 18 '22 at 07:39
It's not a known bug as far as I'm aware, would you mind opening an issue for that in https://github.com/rapidsai/dask-cuda/issues with the details? — pentschev, Oct 18 '22 at 08:00

DASK GPU running out of memory with unified allocator

1 Answers1