3

In Spark, custom Partitioners can be supplied for RDD's. Normally, the produced partitions are randomly distributed to set of workers. For example if we have 20 partitions and 4 workers, each worker will (approximately) get 5 partitions. However the placement of partitions to workers (nodes) seems random like in the table below.

          trial 1    trial 2
worker 1: [10-14]    [15-19]
worker 2: [5-9]      [5-9]  
worker 3: [0-4]      [10-14]
worker 4: [15-19]    [0-4]  

This is fine for operations on a single RDD, but when you are using join() or cogroup() operations that span multiple RDD's, the communication between those nodes becomes a bottleneck. I would use the same partitioner for multiple RDDs and want to be sure they will end up on the same node so the subsequent join() would not be costly. Is it possible to control the placement of partitions to workers (nodes)?

          desired
worker 1: [0-4]
worker 2: [5-9]
worker 3: [10-14]
worker 4: [15-19]
bekce
  • 3,782
  • 29
  • 30
  • ` I would use the same partitioner for multiple RDDs and want to be sure they will end up on the same node so the subsequent join() would not be costly` this is the right way to handle it. You cannot explicitly specify the worker node to be used for each partitions since it would break the abstractions for parallel computation defined by Spark. – rogue-one Jun 07 '17 at 13:45

1 Answers1

3

I would use the same partitioner for multiple RDDs and want to be sure they will end up on the same node so the subsequent join() would not be costly.

This is the right way to handle joins between RDDs so that records to be joined are ensured to be in the same partition/executor.

Is it possible to control the placement of partitions to workers (nodes)

It is not possible to explicitly specify the worker node for each partition. This would break the abstractions of parallel computation defined for Spark or any other parallel computation frameworks like Map-Reduce/Tez etc.

Spark and other parallel computation frameworks are designed to fault tolerant. So this means if a small subset of worker nodes fail, then are replaced with other worker nodes and this process occurs transparently to the user application.

These abstractions would break if a user is allowed to explicitly refer a worker-node in the application. The only means of governing the placement of a partition of RDD is by specifying your own partitions for the RDD partitioner.

rogue-one
  • 11,259
  • 7
  • 53
  • 75
  • You are absolutely right, I am handling it that way. However, I have tried it and the partition assignments of each RDD is **different** than the other, even though they are using the very same partitioner. Since the data ends up on different (random) nodes, the join() becomes terribly suboptimal as second RDD needs to be brought in from all around the cluster. – bekce Jun 07 '17 at 14:10
  • can you some the relevant part of your code (such as the type of partitioned being used) to be investigated further?. – rogue-one Jun 07 '17 at 14:11
  • I have just put the example code in https://gist.github.com/bekce/e14afdc30814e9d4712d0df6ac967cf0 – bekce Jun 07 '17 at 14:24
  • 1
    You could be interested in @Jacek Laskowski 's answer to this question: https://stackoverflow.com/questions/47799726/how-to-set-preferred-locations-for-rdd-partitions-manually . Basically, there is a `MakeRDD()` function available only in the Scala API that can be used to suggest (but only suggest) the location of each RDD's object. – frb Feb 25 '19 at 12:27
  • What might be the reasons Spark would assign more partitions to one node while none to others, even if all are healthy? – kboom May 26 '19 at 09:09
  • @kboom i am facing similar issue https://stackoverflow.com/q/73096143/3892213, did you find answer to your question? – best wishes Aug 04 '22 at 09:11