0

I have a dataset stored in a tab-separated text file. The file looks as follows:

date    time    temperature
2010-01-01  12:00:00    10.0000 
...

where the temperature column contains values in degrees Celsius (°C). I compute the daily average temperature using Dask. Here is my code:

from dask.distributed import Client
import dask.dataframe as dd

client = Client("<scheduler URL")
inputDataFrame = dd.read_table("<input file>").drop('time', axis=1)
groupedData = inputDataFrame.groupby('date')
meanDataframe = groupedData.mean()
result = meanDataframe.compute()
result.to_csv('result.out', sep='\t')

client.close()

In order to improve the performance of my program, I would like to understand the data flow caused by Dask data frames.

  1. How is the text file read into a data frame by read_table()? Does the client read the whole text file and send the data to the scheduler, which partitions the data and sends it to the workers? Or does each worker read the data partitions it works on directly from the text file?
  2. When an intermediate data frame is created (e.g. by calling drop()) is the whole intermediate data frame sent back to the client and then sent to the workers for further processing?
  3. The same question for groups: where is the data for a group object create and stored? How does it flow between client, scheduler and workers?

The reason for my question is that if I run a similar program using Pandas, the computation is roughly two times faster, and I am trying to understand what causes the overhead in Dask. Since the size of the result data frame is very small compared to the size of the input data, I suppose there is quite some overhead caused by moving the input and intermediate data between client, scheduler and workers.

Giorgio
  • 5,023
  • 6
  • 41
  • 71

1 Answers1

1

1) The data are read by the workers. The client does read a little ahead of time, to figure out the column names and types and, optionally, to find line-delimiters for splitting files. Note that all workers must be able to reach the file(s) of interest, which can require some shared file-system when working on a cluster.

2), 3) In fact, the drop, groupby and mean methods do not generate intermediate data-frames at all, they just accumulate a graph of operations to be executed (i.e., they are lazy). You could time these steps and see they are fast. During execution, intermediates are made on workers, copies to other workers as required, and discarded as soon as possible. There are never copies to the scheduler or client, unless you explicitly request so.

So, to the root of your question: you can investigate the performance or your operation best by looking at the dashboard.

There are many factors that govern how quickly things will progress: the processes may be sharing an IO channel; some tasks do not release the GIL, and so parallelise poorly in threads; the number of groups will greatly affect the amount of shuffling of data into groups... plus there is always some overhead for every task executed by the scheduler.

Since Pandas is efficient, it is not surprising that for the case where data fits easily into memory, it performs well compared to Dask.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thanks a lot. I am using the dashboard already but the video in the page you linked in the answer seems quite useful BTW, all the workers in my setup run on virtual machines and access a common `nfs` mount. – Giorgio Oct 13 '18 at 07:28