There are two different RDDs, I want to zipPartition these two rdd, before that, I hope that partitions with the same id in two RDDs can be persist in the same executor, thus there will be no shuffle when zipPartition.
Asked
Active
Viewed 520 times
1 Answers
1
You will have to wrap your RDD inside a new RDD which implements this method:
def getPreferredLocations(split: Partition): Seq[String]
The above method tells scheduler -- what is the preferred location where a given partition should be computed.
[I faced a similar concern while doing a Hashjoin on 2 RDDs and blogged about it. You might want to have a look here.]

Sachin Tyagi
- 2,814
- 14
- 26
-
Thanks, but it seems that we should know of the ip of each machine ahead of time, is there methods to specify the executor id for each partition? – greatji Sep 29 '16 at 07:40
-
Right, there's a method in SparkContext to get the executorsIds but it's Spark local. So I don't really have a way except this bit of hack. – Sachin Tyagi Sep 29 '16 at 09:23
-
@SachinTyagi how about https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/SparkContext.html#makeRDD(scala.collection.Seq,%20scala.reflect.ClassTag)? Does it provide 100% waranty that partition will be placed to specific node? – VB_ Mar 03 '17 at 00:22
-
@SachinTyagi could u please look at http://stackoverflow.com/questions/42568569/enforce-partition-be-stored-on-the-specific-executor? – VB_ Mar 03 '17 at 00:35