1

In Spark I have two PairRDDs (let us call them A and B) consisting of n partitions each. I want to join those RDDs based upon their keys.

Both RDDs are consistently partitioned, i.e., if keys x and y are in the same partition in RDD A, they are also in the same partition in RDD B. For RDD A, I can assure that the partitioning is done using a particular Partitioner. But for RDD B, the partition indices may be different than those from RDD A (RDD B is the output of some legacy library that I am reluctant to touch if not absolutely necessary).

I would like to efficiently join RDD A and B without performing a shuffle. In theory this would be easy if I could reassign the partition numbers of RDD B such that they match those in RDD A.

My question now is: Is it possible to edit the partition numbers of an RDD (basically permuting them)? Or alternatively can one assign a partitioner without causing a shuffle operation? Or do you see another way for solving this task that I am currently too blind to see?

Philosophus42
  • 472
  • 3
  • 11
  • If you have control over the partitioner used for RDD A why not simply use the same partitioner as for RDD B? – zero323 Aug 31 '15 at 18:22
  • The original RDD B I have at hand is not partitioned by that partitioner. And I would like to avoid the shuffle that is the result of a call to partitionBy (which increases the jobs runtime by a factor of 15x). – Philosophus42 Sep 01 '15 at 08:01

1 Answers1

0

Yes, you can change the partition. but to reduce shuffling data must be co-located on the same cluster nodes.

  1. Control the partitioning at data source level and/or using .partition operator
  2. If the small RDD can fit in memory of all workers, then using broadcast variable is the faster option.

As you mentioned, there is consistent partitioning, you do not need to repartition(or editing the existing number of partitions).

Keep in mind to gurantee of data colocation is hard to achieve

Aakash Aggarwal
  • 551
  • 4
  • 7