How to persist one partition to specific executor in Spark?

Question

There are two different RDDs, I want to zipPartition these two rdd, before that, I hope that partitions with the same id in two RDDs can be persist in the same executor, thus there will be no shuffle when zipPartition.

score 1 · Accepted Answer · answered Sep 29 '16 at 04:28

1

You will have to wrap your RDD inside a new RDD which implements this method:

def getPreferredLocations(split: Partition): Seq[String]

The above method tells scheduler -- what is the preferred location where a given partition should be computed.

[I faced a similar concern while doing a Hashjoin on 2 RDDs and blogged about it. You might want to have a look here.]

answered Sep 29 '16 at 04:28

Sachin Tyagi

2,814
14
26

Thanks, but it seems that we should know of the ip of each machine ahead of time, is there methods to specify the executor id for each partition? – greatji Sep 29 '16 at 07:40
Right, there's a method in SparkContext to get the executorsIds but it's Spark local. So I don't really have a way except this bit of hack. – Sachin Tyagi Sep 29 '16 at 09:23
@SachinTyagi how about https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/SparkContext.html#makeRDD(scala.collection.Seq,%20scala.reflect.ClassTag)? Does it provide 100% waranty that partition will be placed to specific node? – VB_ Mar 03 '17 at 00:22
@SachinTyagi could u please look at http://stackoverflow.com/questions/42568569/enforce-partition-be-stored-on-the-specific-executor? – VB_ Mar 03 '17 at 00:35

How to persist one partition to specific executor in Spark?

1 Answers1