Dask compute is very slow

Question

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python

 import dask.dataframe as dd                                          
 dask_df = dd.read_csv(fullPath)
 ............
 for index , row in uniqueURLs.iterrows():
   print(index);
   results = dask_df[dask_df['URL'] == row['URL']]
   count = results.size.compute();

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute(). So If I removed the line that computes the size of results my program turns to be very fast. Can someone explain this? How can I make it faster?

score 8 · Answer 1 · answered Oct 07 '18 at 12:39

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute().

You are misunderstanding how dask.dataframe works. The line results = dask_df[dask_df['URL'] == row['URL']] performs no computation on the dataset. It merely stores instructions as to computations which can be triggered at a later point.

All computations are applied only with the line count = results.size.compute(). This is entirely expected, as dask works lazily.

Think of a generator and a function such as list which can exhaust a generator. The generator itself is lazy, but will trigger operations when called by a function. dask.dataframe is also lazy, but works smartly by forming an internal "chain" of sequential operations.

You should see Laziness and Computing from the docs for more information.

Dask compute is very slow

1 Answers1