I have a dataset in which a few elements, which are close to each other and generally end up in the same partition, cause more computation than others, because they have quadratic complexity. I want to randomly reshuffle them so that the workload ends up distributed more or less equally across partitions and I avoid having to do all the computation in a single partition.
Right now I'm downloading everything in the coordinator with a piece of code like this:
import dask.bag as db
import random
bag = ...
l = bag.compute()
random.shuffle(l)
bag = db.from_sequence(l)
Is there a way to do it in a more distributed way? I tried, for example, by repartitioning based on a random key, but I end up with most partitions empty.