I have an RDD of maps, where the maps are certain to have intersecting key sets. Each map may have 10,000s of entries.
I need to merge the maps, such that those with intersecting key sets are merged, but others are left distinct.
Here's what I have. I haven't tested that it works, but I know that it's slow.
def mergeOverlapping(maps: RDD[Map[Int, Int]])(implicit sc: SparkContext): RDD[Map[Int, Int]] = {
val in: RDD[List[Map[Int, Int]]] = maps.map(List(_))
val z = List.empty[Map[Int, Int]]
val t: List[Map[Int, Int]] = in.fold(z) { case (l, r) =>
(l ::: r).foldLeft(List.empty[Map[Int, Int]]) { case (acc, next) =>
val (overlapping, distinct) = acc.partition(_.keys.exists(next.contains))
overlapping match {
case Nil => next :: acc
case xs => (next :: xs).reduceLeft(merge) :: distinct
}
}
}
sc.parallelize(t)
}
def merge(l: Map[Int, Int], r: Map[Int, Int]): Map[Int, Int] = {
val keys = l.keySet ++ r.keySet
keys.map { k =>
(l.get(k), r.get(k)) match {
case (Some(i), Some(j)) => k -> math.min(i, j)
case (a, b) => k -> (a orElse b).get
}
}.toMap
}
The problem, as far as I can tell, is that RDD#fold
is merging and re-merging maps many more times than it has to.
Is there a more efficient mechanism that I could use? Is there another way I can structure my data to make it efficient?