5

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python

 import dask.dataframe as dd                                          
 dask_df = dd.read_csv(fullPath)
 ............
 for index , row in uniqueURLs.iterrows():
   print(index);
   results = dask_df[dask_df['URL'] == row['URL']]
   count = results.size.compute();

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute(). So If I removed the line that computes the size of results my program turns to be very fast. Can someone explain this? How can I make it faster?

jpp
  • 159,742
  • 34
  • 281
  • 339
Neno M.
  • 123
  • 1
  • 6

1 Answers1

8

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute().

You are misunderstanding how dask.dataframe works. The line results = dask_df[dask_df['URL'] == row['URL']] performs no computation on the dataset. It merely stores instructions as to computations which can be triggered at a later point.

All computations are applied only with the line count = results.size.compute(). This is entirely expected, as dask works lazily.

Think of a generator and a function such as list which can exhaust a generator. The generator itself is lazy, but will trigger operations when called by a function. dask.dataframe is also lazy, but works smartly by forming an internal "chain" of sequential operations.

You should see Laziness and Computing from the docs for more information.

jpp
  • 159,742
  • 34
  • 281
  • 339