I'm trying Dask just for the fun of it, and grasp the good practice. After some try and error, I got the hand of Dask Array. Now with Dask DataFrame, I don't seem to be able to extend the DataFrame in a balanced distributed scheme.
Here's an example, I'm doing some dummy test with my laptop on small data, with 8 workers (1 per cpu core and 2gb RAM for each).
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
client = Client()
df = pd.read_csv('dataset.csv',
sep=';',
index_col=0,
parse_dates=[0],
infer_datetime_format=True)
#The index is datetime.
ddf = dd.from_pandas(df, npartitions=8)
del df
ddf = ddf.persist()
# My workers are equally filled with 76mb
# I'm now adding new cols, from date attributes.
ddf['date'] = ddf.index
ddf["yearly"] = ddf['date'].dt.week
ddf["weekly"] = ddf['date'].dt.weekday
ddf["daily"] = ddf['date'].dt.time.astype('str')
ddf = ddf.drop(labels=['date'], axis=1)
ddf = ddf.persist()
# Ok, now one of the worker contains 130Mb, the others took no load.
# Let's make more col with some one hot encoding on those attributes.
dum = list()
dum.append(ddf['yearly'].unique().compute())
dum.append(ddf['weekly'].unique().compute())
dum.append(ddf['daily'].unique().compute())
for e in dum:
for i in e[1:].index:
ddf['{}_{}'.format(e.name, i)] = (ddf[e.name] == e[i]).astype('int')
ddf = ddf.persist()
With the last persist(), I can observe that only one worker is having cpu activity and is getting its memory loaded (up to 337MiB).
It seems to me that the columns I create are inferred row wise from the partitioned original index. I was expecting each partitions to grow equally, each using one worker's allocated memory. Am I missing something here, or is this a limitation of Dask DataFrame ?