Adding columns in a Dask DataFrame overload one worker

Question

I'm trying Dask just for the fun of it, and grasp the good practice. After some try and error, I got the hand of Dask Array. Now with Dask DataFrame, I don't seem to be able to extend the DataFrame in a balanced distributed scheme.

Here's an example, I'm doing some dummy test with my laptop on small data, with 8 workers (1 per cpu core and 2gb RAM for each).

import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
client = Client()


df = pd.read_csv('dataset.csv',
                 sep=';',
                 index_col=0, 
                 parse_dates=[0],
                 infer_datetime_format=True)

#The index is datetime.

ddf = dd.from_pandas(df, npartitions=8)
del df

ddf = ddf.persist()
# My workers are equally filled with 76mb

# I'm now adding new cols, from date attributes.
ddf['date'] = ddf.index
ddf["yearly"] = ddf['date'].dt.week
ddf["weekly"] = ddf['date'].dt.weekday
ddf["daily"] = ddf['date'].dt.time.astype('str')
ddf = ddf.drop(labels=['date'], axis=1)
ddf = ddf.persist()

# Ok, now one of the worker contains 130Mb, the others took no load.
# Let's make more col with some one hot encoding on those attributes.
dum = list()
dum.append(ddf['yearly'].unique().compute())
dum.append(ddf['weekly'].unique().compute())
dum.append(ddf['daily'].unique().compute())

for e in dum:
    for i in e[1:].index:
        ddf['{}_{}'.format(e.name, i)] = (ddf[e.name] == e[i]).astype('int')

ddf = ddf.persist()

With the last persist(), I can observe that only one worker is having cpu activity and is getting its memory loaded (up to 337MiB).

It seems to me that the columns I create are inferred row wise from the partitioned original index. I was expecting each partitions to grow equally, each using one worker's allocated memory. Am I missing something here, or is this a limitation of Dask DataFrame ?

did you ever find an explanation for this? – John R Feb 14 '23 at 00:05 — John R, Feb 14 '23 at 00:05

score 0 · Answer 1 · answered Feb 14 '23 at 01:24

0

For anyone else running into this, you are most likely working with a dataframe of only one partition. Check with df.npartitions and make sure there are enough to go around for all of your workers.

answered Feb 14 '23 at 01:24

John R

1,505
10
18

Adding columns in a Dask DataFrame overload one worker

1 Answers1