I have a key-value RDD and I need to join several keysets with this RDD.
Key-value RDD is big (100GB), keysets are relatively small (but not small enough to broadcast it)
I assign the same partitioner to all RDDs and call join
.
Expected behavior: After repartitioning all the data to be joined is colocated, join is fast enough. If the keys RDD is small it should be blazing fast.
Actual behavior: Join takes significant time (~10 minutes) even if keys RDD is small or empty.
//declare common partitioner for all rdds
val partitioner = new HashPartitioner(500)
//declare key-value rdd
val storage: RDD[(K, V)] = {
val storage0: RDD[(K,V)] = ???
storage0.partitionBy(partitioner).persist(StorageLevel.MEMORY_AND_DISK)
}
storage.count()
//join several rdds with the storage
(1 to 1000).foreach(i => {
val keys: RDD[K] = ???
val partitionedKeys = keys.map(k => k -> ()).partitionBy(partitioner)
//join keys rdd with the storage, do smth with the result
partitionedKeys.join(storage).foreachPartition(iter => {
???
})
})
What do I get wrong here?