2

I have an RDD of maps, where the maps are certain to have intersecting key sets. Each map may have 10,000s of entries.

I need to merge the maps, such that those with intersecting key sets are merged, but others are left distinct.

Here's what I have. I haven't tested that it works, but I know that it's slow.

def mergeOverlapping(maps: RDD[Map[Int, Int]])(implicit sc: SparkContext): RDD[Map[Int, Int]] = {
  val in: RDD[List[Map[Int, Int]]] = maps.map(List(_))
  val z = List.empty[Map[Int, Int]]

  val t: List[Map[Int, Int]] = in.fold(z) { case (l, r) =>
    (l ::: r).foldLeft(List.empty[Map[Int, Int]]) { case (acc, next) =>
      val (overlapping, distinct) = acc.partition(_.keys.exists(next.contains))
      overlapping match {
        case Nil => next :: acc
        case xs => (next :: xs).reduceLeft(merge) :: distinct
      }
    }
  }

  sc.parallelize(t)
}

def merge(l: Map[Int, Int], r: Map[Int, Int]): Map[Int, Int] = {
  val keys = l.keySet ++ r.keySet
  keys.map { k =>
    (l.get(k), r.get(k)) match {
      case (Some(i), Some(j)) => k -> math.min(i, j)
      case (a, b) => k -> (a orElse b).get
    }
  }.toMap
}

The problem, as far as I can tell, is that RDD#fold is merging and re-merging maps many more times than it has to.

Is there a more efficient mechanism that I could use? Is there another way I can structure my data to make it efficient?

Synesso
  • 37,610
  • 35
  • 136
  • 207
  • How many maps are there? – Karl Bielefeldt Apr 07 '16 at 03:14
  • There are about 500 maps at the end of the process. As input, maybe 5,000 to 10,000. I haven't measured. – Synesso Apr 07 '16 at 03:51
  • 2
    Take a look at the `groupWith` function I wrote here: http://stackoverflow.com/a/35919875/21755 . That would (with slight modifications) give you an RDD where each entry was a list of Maps with overlapping keys. Then a map over that could merge the maps (in fact) – The Archetypal Paul Apr 07 '16 at 08:15

0 Answers0