Running out of memory in Dask cuDF

Question

I've been trying to solve memory management issues in dask_cudf in my recent project for quite some time recently, but it seems I'm missing something and I need your help. I am working on Tesla T4 GPU with 15 GiB memory. I have several ETL steps but the GPU recently seems to be failing on most of them (most of them are just filtering or transformation steps, but few revolve shuffling). My data consists of around 20 500MB parquet files. For this specific question I will provide a piece of code I use for filtering which makes the GPU fail due to lack of memory.

I start by setting up a CUDA cluster:

CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0")

cluster = LocalCUDACluster(
#     rmm_pool_size=get_rmm_size(0.6 * device_mem_size()),
    CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
    local_directory=os.path.join(WORKING_DIR, "dask-space"),
    device_memory_limit=parse_bytes("12GB")
)
client = Client(cluster)
client

Depending whether I provide rmm_pool_size parameter the error is different. When the parameter is provided I get that Maximum pool limit is exceeded and otherwise I get the following error: MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Next, I create a filtering operation I intend to perform on data (which revolves checking whether a value in a column appears in a set containing around 80000 values):

def remove_invalid_values_filter_factory(valid_value_set_or_series):
    def f(df):
        mask = df['col'].isin(valid_value_set_or_series)
        return df.loc[mask]
    return f

# Load valid values from another file
valid_values_info_df = pd.read_csv(...)
# The series is around 1 MiB in size
keep_known_values_only = remove_invalid_values_filter_factory(valid_values_info_df['values'])
# Tried both and both cause the error
# keep_known_values_only = remove_invalid_values_filter_factory(set(valid_values_info_df['values']))

Finally I apply this filter operation on the data and get the error:

%%time
# Error occures during this processing step
keep_known_values_only(
    dask_cudf.read_parquet(...)
).to_parquet(...)

I feel totally lost, most sources I came across have this error as a result of using cuDF without Dask or not setting CUDA cluster, but I have both. Additionally, intuitively the filtering operation shouldn't be memory expensive, so I have no clue what to do. I assume there is something wrong with how I set up the cluster, and fixing it would make the rest of more memory expensive operations hopefully work as well.

I would be grateful for your help, thanks!

score 0 · Answer 1 · answered Jul 01 '22 at 23:36

I'd use dask-sql for this to take advantage of it's ability to do out of core processing.

As for the dask_cudf functions failing, please make an issue in the cudf repo with a minimum reproducible! We'd appreciate it! :)

You may not want to do dask_cudf and RMM together unless you really have to and know what you're doing (that's like RAPIDS super use mode, when you need to really maximize GPU size used for an algo). If your use calls for that (and it doesn't seem to here as you're using parquet files, which is why I'm not deep diving into it), it can really help.

Running out of memory in Dask cuDF

1 Answers1