Dask dataframe reshuffeling on many parquet files

Question

I have a dask cluster spread around many worker nodes. I also have a S3 bucket with as many parquet files (right now 500k files, might three times the size in the future).

The data in the parquet is mostly text: [username, first_name, last_name, email, email_domain]

I want to load them up, reshuffle them, and store the new partitions. I want to be able to group opeartions based on email_domain, so I'd like group a new parquet file per each email_domain

Currently i use from_delayed, and groupby but the resulting DAG has a shuffle-split layer with n**2 size - and this does not fit in my scheduler memory. Something along the lines:

def store(x):
   path = f's3://bucket/{x.name}.parquet'
   x.to_parquet(path)
   return path

z = df.groupby('email_domain').apply(store, meta=('email_domain', 'object'))
visualize(z)
z.compute()

score 0 · Answer 1 · answered Feb 20 '19 at 01:18

0

Yes, groupby-apply is expensive, especially in parallel.

I would expect things to still work, but just to be slow.

answered Feb 20 '19 at 01:18

MRocklin

55,641
23
163
235

Dask dataframe reshuffeling on many parquet files

1 Answers1