1

There are two different RDDs, I want to zipPartition these two rdd, before that, I hope that partitions with the same id in two RDDs can be persist in the same executor, thus there will be no shuffle when zipPartition.

greatji
  • 87
  • 1
  • 11

1 Answers1

1

You will have to wrap your RDD inside a new RDD which implements this method:

def getPreferredLocations(split: Partition): Seq[String]

The above method tells scheduler -- what is the preferred location where a given partition should be computed.

[I faced a similar concern while doing a Hashjoin on 2 RDDs and blogged about it. You might want to have a look here.]

Sachin Tyagi
  • 2,814
  • 14
  • 26
  • Thanks, but it seems that we should know of the ip of each machine ahead of time, is there methods to specify the executor id for each partition? – greatji Sep 29 '16 at 07:40
  • Right, there's a method in SparkContext to get the executorsIds but it's Spark local. So I don't really have a way except this bit of hack. – Sachin Tyagi Sep 29 '16 at 09:23
  • @SachinTyagi how about https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/SparkContext.html#makeRDD(scala.collection.Seq,%20scala.reflect.ClassTag)? Does it provide 100% waranty that partition will be placed to specific node? – VB_ Mar 03 '17 at 00:22
  • @SachinTyagi could u please look at http://stackoverflow.com/questions/42568569/enforce-partition-be-stored-on-the-specific-executor? – VB_ Mar 03 '17 at 00:35