Spark: join with a cached RDD takes too long

Question

I have a key-value RDD and I need to join several keysets with this RDD.

Key-value RDD is big (100GB), keysets are relatively small (but not small enough to broadcast it)

I assign the same partitioner to all RDDs and call join.

Expected behavior: After repartitioning all the data to be joined is colocated, join is fast enough. If the keys RDD is small it should be blazing fast.

Actual behavior: Join takes significant time (~10 minutes) even if keys RDD is small or empty.

//declare common partitioner for all rdds
val partitioner = new HashPartitioner(500)

//declare key-value rdd
val storage: RDD[(K, V)] = {
  val storage0: RDD[(K,V)] = ???
  storage0.partitionBy(partitioner).persist(StorageLevel.MEMORY_AND_DISK)
}
storage.count()

//join several rdds with the storage
(1 to 1000).foreach(i => {
  val keys: RDD[K] = ???
  val partitionedKeys = keys.map(k => k -> ()).partitionBy(partitioner)
  //join keys rdd with the storage, do smth with the result
  partitionedKeys.join(storage).foreachPartition(iter => {
     ???
  })
})

What do I get wrong here?

Your assumption is incorrect - co-partitioned data is guaranteed to be co-located, if shuffle is performed during the same action. Since you execute multiple actions, with different RDD, data will be co-located only for the first one. For the subsequent actions the only guarantee you get is one-to-one dependency. As a result there is no need shuffle, but data might be still transferred over the network. It is also possible that you experience unrelated problems, including data skews, long GC pauses, and cache expensive cache dumps (memory -> disk). — zero323, Jan 03 '18 at 15:00
@user6910411 thank you for your answer. GC and disk are definitely not an issue. How can I achieve data co-location in my case? — simpadjo, Jan 03 '18 at 15:06
You can garantee that data will have the same partition number, but not that the partitions of same number will be assigned to the same location. Maybe you can have a look at "preferred locations", see http://codingcat.me/2016/02/29/how-spark-decides-preferredlocation-for-a-task/ and https://stackoverflow.com/questions/47799726/how-to-set-preferred-locations-for-rdd-partitions-manually — Marie, Jan 03 '18 at 15:48
Where does your data come from, and what objects are your keys ? Maybe there is another solution than repartitioning. — Marie, Jan 03 '18 at 15:52

Spark: join with a cached RDD takes too long

0 Answers0