I have a dask cluster spread around many worker nodes. I also have a S3 bucket with as many parquet files (right now 500k files, might three times the size in the future).
The data in the parquet is mostly text: [username, first_name, last_name, email, email_domain]
I want to load them up, reshuffle them, and store the new partitions. I want to be able to group opeartions based on email_domain, so I'd like group a new parquet file per each email_domain
Currently i use from_delayed, and groupby but the resulting DAG has a shuffle-split
layer with n**2
size - and this does not fit in my scheduler memory.
Something along the lines:
def store(x):
path = f's3://bucket/{x.name}.parquet'
x.to_parquet(path)
return path
z = df.groupby('email_domain').apply(store, meta=('email_domain', 'object'))
visualize(z)
z.compute()