How to merge and aggregate 2 Maps in scala most efficiently?

Question

I have the following 2 maps:

val map12:Map[(String,String),Double]=Map(("Sam","0203") -> 16216.0, ("Jam","0157") -> 50756.0, ("Pam","0129") -> 3052.0)
val map22:Map[(String,String),Double]=Map(("Jam","0157") -> 16145.0, ("Pam","0129") -> 15258.0, ("Sam","0203") -> -1638.0, ("Dam","0088") -> -8440.0,("Ham","0104") -> 4130.0,("Hari","0268") -> -108.0, ("Om","0169") -> 5486.0, ("Shiv","0181") -> 275.0, ("Brahma","0148") -> 18739.0)

In the first approach I am using foldLeft to achieve the merging and accumulation:

val t1 = System.nanoTime()
val merged1 = (map12 foldLeft map22)((map22, map12) => map22 + (map12._1 -> (map12._2 + map22.getOrElse(map12._1, 0.0))))
val t2 = System.nanoTime()
println(" First Time taken :"+ (t2-t1))

In the second approach I am trying to use aggregate() function which supports parallel operation:

def merge(map12:Map[(String,String),Double], map22:Map[(String,String),Double]):Map[(String,String),Double]=
  map12 ++ map22.map{case(k, v) => k -> (v + (map12.getOrElse(k, 0.0)))}

val inArr= Array(map12,map22)

val t5 = System.nanoTime()
val mergedNew12 = inArr.par.aggregate(Map[(String,String),Double]())(merge,merge)
val t6 = System.nanoTime()
println(" Second Time taken :"+ (t6-t5))

But I notice the foldLeft is much faster than the aggregate.

I am looking for advice on how to make this operation the most efficient.

Your array has 2 elements, so when you create `.par` array, how many chunks do you think Scala will create for `.aggregate`? I believe the answer is 1. — Victor Moroz, Aug 31 '18 at 19:37
You are right this coding approach will not be suitable for my problem. In my code, I have 2 Maps on which I am using foldLeft to aggregate and merge the maps. The Maps will be having a lot of data in them(many thousands of rows). Performance is a concern. So I am trying to figure out what could be a better approach. — Shiv, Sep 01 '18 at 12:56
Is it a requirement to have immutable `Map`s? Merging one mutable `Map` into another is ~ 5x faster on my machine than `foldLeft` (depends on how many keys you need to copy though). — Victor Moroz, Sep 01 '18 at 17:23
I am thinking that if I convert the foldLeft to a fold might be able to get parallelism. — Shiv, Sep 06 '18 at 14:14

score 0 · Answer 1 · answered Aug 31 '18 at 19:32

0

If you want an aggregate more efficient by running with par, try with Vector instead of Array, it is one of the best collections for parallel algorithms.

On the other hand, parallel working has some overhead so If you have insufficient data, it will be not convenient.

With the data you gave us, Vector.par.aggregate is better than Array.par.aggregate, but Vector.aggregate is better than foldLeft.

val inVector= Vector(map12,map22)

val t7 = System.nanoTime()
val mergedNew12_2 = inVector.aggregate(Map[(String,String),Double]())(merge,merge)
val t8 = System.nanoTime()
println(" Third Time taken :"+ (t8-t7))

These are my times

First Time taken :6431723
Second Time taken:147474028
Third Time taken :4855489

answered Aug 31 '18 at 19:32

Sebastian Celestino

1,388
8
15

First Time taken :2085636 Second Time taken :4611369 Third Time taken :3220944. These are the timings on my machine. I am getting such results consistently where foldLeft is the most efficient in majority of the test runs. I could replicate your pattern of results may be once in every 10 runs. – Shiv Sep 01 '18 at 12:26
maybe your collections are too small, causing the overhead of parallel execution to be greater than the benefit? just a guess. – Seth Tisue Sep 01 '18 at 19:38

How to merge and aggregate 2 Maps in scala most efficiently?

1 Answers1