I've an RDD of (key, value) pairs. I need to fetch top k values according to their frequencies for each key.
I understand that the best way to do this would be using combineByKey.
Currently here is what my combineByKey combinators look like
object TopKCount {
//TopK Count combiners
val k: Int = 10
def createCombiner(value: String): Map[String, Long] = {
Map(value -> 1L)
}
def mergeValue(combined: Map[String, Long], value: String): Map[String, Long] = {
combined ++ Map(value -> (combined.getOrElse(value, 0L) + 1L))
}
def mergeCombiners(combined1: Map[String, Long], combined2: Map[String, Long]): Map[String, Long] = {
val top10Keys1 = combined1.toList.sortBy(_._2).takeRight(k).toMap.keys
val top10Keys2 = combined2.toList.sortBy(_._2).takeRight(k).toMap.keys
(top10Keys1 ++ top10Keys2).map(key => (key, combined1.getOrElse(key, 0L) + combined2.getOrElse(key, 0L)))
.toList.sortBy(_._2).takeRight(k).toMap
}
}
I use this as follows:
// input is RDD[(String, String)]
val topKValueCount: RDD[(String, Map[String, Long])] = input.combineByKey(
TopKCount.createCombiner,
TopKCount.mergeValue,
TopKCount.mergeCombiners
)
One optimization to the current code would be to use min-queue during mergeCombiners.
I'm more concerned about the network I/O. Would it be possible that once I do the merging in a Partition, I only send the topK entries from this partition to the driver, instead of sending the entire Map, which I'm doing in the current case.
Highly appreciate any feedback.