After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition()
method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of the data after processing, which is yet unknown.
I think I can do lazy calculation of size by df.memory_usage().sum()
but repartition()
does not seem to accept it (scalar) as an argument.
Is there a way to do this kind of adaptative (data-size-based) lazy repartitioning?
PS. Since this is the (almost) last step in my pipeline, I can probably work around this by converting to delayed and repartitioning "manually" (I don't need to go back to dataframe), but I'm looking for a simpler way to do this.
PS. Repartitioning by partition size would also be a very useful feature