3

After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition() method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of the data after processing, which is yet unknown.

I think I can do lazy calculation of size by df.memory_usage().sum() but repartition() does not seem to accept it (scalar) as an argument.

Is there a way to do this kind of adaptative (data-size-based) lazy repartitioning?

PS. Since this is the (almost) last step in my pipeline, I can probably work around this by converting to delayed and repartitioning "manually" (I don't need to go back to dataframe), but I'm looking for a simpler way to do this.

PS. Repartitioning by partition size would also be a very useful feature

evilkonrex
  • 255
  • 2
  • 10

1 Answers1

2

Unfortunately Dask's task-graph construction happens immediately and there is no way to partition (or do any operation) in a way where the number of partitions is not immediately known or is lazily computed.

You could, as you suggest, switch to lower-level systems like delayed. In this case I would switch to using futures and track the size of results as they came in, triggering appropriate merging of partitions on the fly. This is probably far more complex than is desired though.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Just to follow up on this one, I ended up converting dataframe to delayed and constructed a chain of delayed functions that would merge partitions (by passing them along) into big enough chunks to be written/uploaded. This is a bit slow, but preserves parition order and works for me because I need this repartitioning only for the purpose of storage. – evilkonrex Sep 27 '17 at 11:23