1

I am attempting to use Dask to work on a larger-than-memory dataset on my laptop through a Jupyter notebook. The data is stored as many csv files in an Amazon-s3 bucket.

This first cell runs quickly and I can view the Dask dashboard on port 8787 as expected.

from dask.distributed import Client
import dask.dataframe as dd
client = Client()
client

This second cell is executed in 55.1s which seems odd to me since it isn't actually pulling any data.

df = dd.read_csv('s3://*/*/*.csv', assume_missing=True)

This third cell hangs for 11 minutes before I see anything in the Task Stream in the dashboard, but then it works as expected, executing in 13m 3s total.

df['timestamp']=dd.to_datetime(df['timestamp'], unit='ms')
df = df.set_index('timestamp')
df = client.persist(df)

A picture of my dashboard

This seems similar in spirit to Dask Distributed client takes to long to initialize in jupyter lab but my client starts fine, and everything does work eventually. Am I missing something obvious? Thanks!

James McKeown
  • 1,284
  • 1
  • 9
  • 14

1 Answers1

1

You could, of course, run a profiler to find out what exactly is taking time. Even the scheduler has profiling information, though not as accessible.

Chances are, the time is being taken scanning all the file information for the many many files on S3. Dask must list all of these files, to find out how big they all are and assign the blocks to read, which take many slow HTTP calls.

That, in turn, produces a very large number of tasks, as you have found. The total graph of tasks must be serialised and sent to the scheduler, in order to be executed, and then handled by the scheduler and send out to workers. The bigger the graph, the more these costs add up.

The short story is, if you want to optimise for data throughput rate, you would do well to partition your incoming data into much bigger chunks. You will see some recommendations of data chunk sizes of the order 100MB.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I suspected that the partition scheme wasn't optimal but I guess I was hoping it wouldn't impede reading that much. In the past, I've restricted my scope to one month of data, used `for` loops to read in a bunch of pandas dataframes, and then concatenated them. This finished before I was done getting coffee so I didn't mind. What would you recommend for profiling? I suspect `cProfile.run('client.persist(df)')` isn't the right thing, but I'm a total profiling n00b here. – James McKeown Jan 16 '19 at 22:20
  • 1
    I usually start with snakeviz, but there are plenty of tools out there – mdurant Jan 16 '19 at 23:03