0

I am grouping a RDD based on a key.

rdd.groupBy(_.key).partitioner
=> org.apache.spark.HashPartitioner@a

I see that by default Spark, associates HashPartitioner with this RDD, which is fine by me because I agree that we need some kind of partitioner to bring alike data to one executor. But, later in the program I want the RDD to forget about its partitioner strategy because I want to join it with another RDD which follows different partitioning strategy. How can we remove the partitioner from the RDD?

shashwat
  • 81
  • 5
  • a simple map operation will do the job. I don't think it required, when you are going to perform join action on pairedRDD, you can specify partitioner which will partition the RDD again. – banjara May 05 '16 at 14:22
  • Thank you @shekhar, I wrote rdd.map(e => e) and it worked. Actually the real problem is http://stackoverflow.com/questions/37051718/apache-spark-join-two-rdds-with-different-partitioners . I thought I make it work by making one of the rdd forget about its partitioner. Do you happen to know another way around? – shashwat May 05 '16 at 15:52
  • can't you broadcast 2nd RDD and then apply in memory join by iterating over 1st RDD?? – banjara May 05 '16 at 16:59
  • No, even 2nd RDD is too big(~10GB) for broadcating, and I have 15 other such RDDs which have to be joined with rdd1 which is ~100GB – shashwat May 06 '16 at 05:34

0 Answers0