Why is the compute() method slow for Dask dataframes but the head() method is fast?

Question

So I'm a newbie when it comes to working with big data. I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right direction...

So I understand why the following query using the "compute" method would be slow(lazy computation):

df.loc[1:5 ,'enaging_user_following_count'].compute()

btw it took 10 minutes to compute.

But what I don't understand is why the following code using the "head" method prints the output in less than two seconds (i.e., In this case, I'm asking for 250 rows while the previous code snippet was just for 5 rows):

df.loc[50:300 , 'enaging_user_following_count'].head(250)

Why doesn't the "head" method take a long time? I feel like I'm missing something here because I'm able to pull a huge number of rows in a way shorter time than when using the "compute" method. Or is the compute method used in other situations?

Note: I tried to read the documentation but there was no explanation to why head() is fast.

To my eyes, it seems like the second example only processes at most 300 rows where as the first processes the entire dataset — juanpa.arrivillaga, Jul 01 '20 at 02:40
@juanpa.arrivillaga Really? am I doing something wrong? In the first example, I specified "1:5"... How am I processing the entire dataset? — Faisal, Jul 01 '20 at 02:45

score 2 · Answer 1 · answered Jul 01 '20 at 14:37

This processes the whole dataset

df.loc[1:5, 'enaging_user_following_count'].compute()

The reason is, that loc is a label-based selector, and there is no telling what labels exist in which partition (there's no reason that they should be monotonically increasing). In the case that the index is well-formed, you may have useful values of df.divisions, and in this case Dask should be able to pick only the partitions of your data that you need.

score 2 · Answer 2 · answered Jul 01 '20 at 14:59

I played around with this a bit half a year ago. .head() is not checking all your partitions but simply checks the first partition. There is no synchronization overhead etc. so it is quite fast, but it does not take the whole Dataset into account.

You could try

df.loc[-251: , 'enaging_user_following_count'].head(250)

IIRC you should get the last 250 entries of the first partition instead of the actual last indices.

Also if you try something like

df.loc[conditionThatIsOnlyFulfilledOnPartition3 , 'enaging_user_following_count'].head(250)

you get an error that head could not find 250 samples.

If you actually just want the first few entries however it is quite fast :)

Why is the compute() method slow for Dask dataframes but the head() method is fast?

2 Answers2