0

I have a dask cluster spread around many worker nodes. I also have a S3 bucket with as many parquet files (right now 500k files, might three times the size in the future).

The data in the parquet is mostly text: [username, first_name, last_name, email, email_domain]

I want to load them up, reshuffle them, and store the new partitions. I want to be able to group opeartions based on email_domain, so I'd like group a new parquet file per each email_domain

Currently i use from_delayed, and groupby but the resulting DAG has a shuffle-split layer with n**2 size - and this does not fit in my scheduler memory. Something along the lines:

def store(x):
   path = f's3://bucket/{x.name}.parquet'
   x.to_parquet(path)
   return path

z = df.groupby('email_domain').apply(store, meta=('email_domain', 'object'))
visualize(z)
z.compute()
t_z
  • 96
  • 2
  • 5

1 Answers1

0

Yes, groupby-apply is expensive, especially in parallel.

I would expect things to still work, but just to be slow.

MRocklin
  • 55,641
  • 23
  • 163
  • 235