Currently I have a problem where some certain partitions need to do more operations locally than the others. This problem leads to unbalance work over the network (some workers are idling because they finish the tasks way earlier). So, I decide to filter out those heavy-operation partitions to a new rdd and repartition this rdd in order to distribute the tasks evenly to other workers.
However, when I print out the partitions after repartitioning, I find out that most or almost all of the partitions are still on one worker. This fact is not like what I expected at all for the repartition
operation because it will cause shuffle
.
Here is an example of my problem:
The data is [(0,0), (1,1), (3,3), ..., (99,99)]
. There are 5 slaves and 1 master. After doing some operations, one slave has a partition that needs to do a lot more operations than the others. This partition contains like [(5,5), (6,6), ..., (80,80)]
(let's say it's on worker 2). Then, I'll filter this partition out to a new rdd and use flatMap
to split the data out.
After that, I use the repartition
operation to split this rdd out into smaller partitions which I expect that these partitions should be on other workers as well (I tried partitionBy
as well). However, when I print out the data in each partition of this rdd, all/most of the partitions are on only one worker (let's say worker 3). Even though the data moved from worker 2 to worker 3, it is still not evenly distributed to other workers.
How do I distribute these partitions to other workers evenly? Are there any other solutions for my current problem? I've been stuck on this problem for a while.
Thanks