I have a few thousand CSV files in S3, and I want to load them, concatenate them together into a single pandas dataframe, and share that entire dataframe with all dask workers on a cluster. All of the files are approximately the same size (~1MB). I am using 8 processes per machine (one per core) and one thread per process. The entire dataframe fits comfortably into each worker process's memory. What is the most efficient and scalable way to accomplish this?
I implemented this workflow using MPI4py as follows: use a thread pool in one worker process to read all of the files into pandas dataframes, concatenate the dataframes together, and use MPI4py's broadcast function to send the complete dataframe to all of the other worker processes.
I've thought of five ways to accomplish this in dask:
- Each worker reads all of the files using pandas.read_csv, and then concatenates them together using pandas.concat.
- Use dask.dataframe.from_delayed to read all files into a distributed dask dataframe, use dask.distributed.worker_client to get a client on each worker process, and then use dask.dataframe.compute on each worker to get the pandas dataframe.
- Load the distributed dataframe as in solution 2, use dask.distributed.Client.replicate to distribute all of the partitions to all of the workers, use dask.distributed.worker_client to get a client on each worker process, and use dask.dataframe.compute to get the pandas dataframe in each worker process.
- Load the distributed dataframe as in solution 2, use dask.dataframe.compute to bring the the dataframe into the local process, delete the distributed dataframe from the cluster (by cancelling the futures), and use dask.distributed.Client.scatter(broadcast=True, direct=True) to send the local pandas dataframe to all workers.
- Load the distributed dataframe and gather it to the local process as in solution 4, use dask.distributed.Client.scatter(broadcast=False) to send it to a worker process, use dask.distributed.Client.replicate to send it to all other workers.
Solutions 2-5 have a huge advantage over the MPI4py version in that they leverage off of dask's ability to load the dataframe in parallel. However, none of those solutions get anywhere close to the performance of MPI4py's broadcast function when it's time to distribute the data around the cluster. In addition, I'm having trouble predicting their memory usage, and I see many messages from the workers complaining that the event loop was unresponsive for multiple seconds.
At this stage, I'm inclined to use the first solution: even though the data-loading is inefficient, it's not that slow, and in my experience it's the most robust. Surely I will be leaving a lot of performance potential on the table if I go this route. Is there any way to improve one of the dask solutions? Or is there another solution that I haven't considered?