In Spark I have two PairRDDs (let us call them A and B) consisting of n partitions each. I want to join those RDDs based upon their keys.
Both RDDs are consistently partitioned, i.e., if keys x and y are in the same partition in RDD A, they are also in the same partition in RDD B. For RDD A, I can assure that the partitioning is done using a particular Partitioner. But for RDD B, the partition indices may be different than those from RDD A (RDD B is the output of some legacy library that I am reluctant to touch if not absolutely necessary).
I would like to efficiently join RDD A and B without performing a shuffle. In theory this would be easy if I could reassign the partition numbers of RDD B such that they match those in RDD A.
My question now is: Is it possible to edit the partition numbers of an RDD (basically permuting them)? Or alternatively can one assign a partitioner without causing a shuffle operation? Or do you see another way for solving this task that I am currently too blind to see?