I have a dataset stored in a tab-separated text file. The file looks as follows:
date time temperature
2010-01-01 12:00:00 10.0000
...
where the temperature
column contains values in degrees Celsius (°C).
I compute the daily average temperature using Dask. Here is my code:
from dask.distributed import Client
import dask.dataframe as dd
client = Client("<scheduler URL")
inputDataFrame = dd.read_table("<input file>").drop('time', axis=1)
groupedData = inputDataFrame.groupby('date')
meanDataframe = groupedData.mean()
result = meanDataframe.compute()
result.to_csv('result.out', sep='\t')
client.close()
In order to improve the performance of my program, I would like to understand the data flow caused by Dask data frames.
- How is the text file read into a data frame by
read_table()
? Does the client read the whole text file and send the data to the scheduler, which partitions the data and sends it to the workers? Or does each worker read the data partitions it works on directly from the text file? - When an intermediate data frame is created (e.g. by calling
drop()
) is the whole intermediate data frame sent back to the client and then sent to the workers for further processing? - The same question for groups: where is the data for a group object create and stored? How does it flow between client, scheduler and workers?
The reason for my question is that if I run a similar program using Pandas, the computation is roughly two times faster, and I am trying to understand what causes the overhead in Dask. Since the size of the result data frame is very small compared to the size of the input data, I suppose there is quite some overhead caused by moving the input and intermediate data between client, scheduler and workers.