How to repartition a dataframe into fixed sized partitions?

Question

I have a dask dataframe created from delayed functions which is comprised of randomly sized partitions. I would like to repartition the dataframe into chunks of size (approx) 10000.

I can calculate the correct number of partitions with np.ceil(df.size/10000) but that seems to immediately compute the result?

IIUC to compute the result it would have had to read all the dataframes into memory which would be very inefficient. I would instead like to specify the whole operation as a dask graph to be submitted to the distributed scheduler so no calculations should be done locally.

Is there some way to specify npartitions without having it immediately compute all the underlying delayed functions?

score 4 · Accepted Answer · answered Mar 17 '17 at 12:18

4

Short answer is probably "no, there is no way to do this without looking at the data". The reason here is that the structure of the graph depends on the values of your lazy partitions. For example we'll have a different number of nodes in the graph depending on your total datasize.

answered Mar 17 '17 at 12:18

MRocklin

55,641
23
163
235

For reference, I implemented something using futures over at https://stackoverflow.com/questions/51243753/dynamic-repartitioning-of-a-dask-dataframe – Dave Hirschfeld Oct 11 '18 at 03:57

How to repartition a dataframe into fixed sized partitions?

1 Answers1