How much overhead is there per partition when loading dask_cudf partitions into GPU memory?

Question

PCIE bus bandwidth latencies force constraints on how and when applications should copy data to and from GPUs.

When working with cuDF directly, I can efficiently move a single large chunk of data into a single DataFrame.

When using dask_cudf to partition my DataFrames, does Dask copy partitions into GPU memory one at a time? In batches? If so, is there significant overhead from multiple copy operations instead of a single larger copy?

score 1 · Answer 1 · answered Feb 20 '19 at 00:36

This probably depends on the scheduler you're using. As of 2019-02-19 dask-cudf uses the single-threaded scheduler by default (cudf segfaulted for a while there if used in multiple threads), so any transfers would be sequential if you're not using some dask.distributed cluster. If you're using a dask.distributed cluster, then presumably this would happen across each of your GPUs concurrently.

It's worth noting that dask.dataframe + cudf doesn't do anything special on top of what cudf would do. It's as though you called many cudf calls in a for loop, or in one for-loop per GPU, depending on the scheduler choice above.

Disclaimer: cudf and dask-cudf are in heavy flux. Future readers probably should check with current documentation before trusting this answer.

How much overhead is there per partition when loading dask_cudf partitions into GPU memory?

1 Answers1