I have a dask dataframe created from delayed functions which is comprised of randomly sized partitions. I would like to repartition the dataframe into chunks of size (approx) 10000.
I can calculate the correct number of partitions with np.ceil(df.size/10000)
but that seems to immediately compute the result?
IIUC to compute the result it would have had to read all the dataframes into memory which would be very inefficient. I would instead like to specify the whole operation as a dask graph to be submitted to the distributed scheduler so no calculations should be done locally.
Is there some way to specify npartitions
without having it immediately compute all the underlying delayed functions?