Map an arithmetical operation to a Scala collection and sum the result

Question

Here's some code I shared with my study group yesterday: https://gist.github.com/natemurthy/019e49e6f5f0d1be8719. After compiling, I run map.scala with the following heap params:

$ scala -J"-Xmx4G" map

and get the following results for 4 separate tests:

// (1L to 20000000L).map(_*2)
(Map) multiplying 20 million elements by 2
(Reduce) sum: 400000020000000
Total MapReduce time: 7.562381

// (1L to 20000000L).toArray.map(_*2)
(Map) multiplying 20 million elements by 2
(Reduce) sum: 400000020000000
Total MapReduce time: 1.233997

// (1L to 20000000L).toVector.map(_*2)
(Map) multiplying 20 million elements by 2
(Reduce) sum: 400000020000000
Total MapReduce time: 15.041896 

// (1L to 20000000L).par.map(_*2)
(Map) multiplying 20 million elements by 2
(Reduce) sum: 400000020000000
Total MapReduce time: 18.586220

I'm trying to figure out why these results vary across different collection types, and more importantly, why performance appears to be worse for collections that should intuitively be evaluated faster. Curious to hear your insights into these results. I've also experimented on performing these operations on Breeze and Saddle (which perform much better on the same tests), but I want to see how far I can push the built-in Scala Collections API.

These tests were run on an Asus Zenbook UX31A, Intel Core i7 3517U 1.9 GHz dual core w/hyperthreading, 4 GB RAM, with Ubuntu 12.04 Desktop. Using Scala 2.11.1 with JDK 1.7

Nice answer was given, yes the problem is your benchmarking the *conversion* into various types in addition to the actual operation. Separate the conversion from the operation to find out actual performance. — samthebest, Jul 17 '14 at 17:54

dhg · Accepted Answer · 2014-07-16T18:56:55.257

There's obviously a lot of things going on here, but here are a couple:

First, The to method creates a Range, which is a very efficient data structure in that it does not actually create a collection with 20 million elements. It just knows how to get the next element as you iterate. When you call map on Range, the output is a Vector, so it iterates through the Range (cheap), multiplies each number by 2 (still cheap), but then has to create a Vector (costly; I guess about 7.5 seconds).

Second, When you call .toVector on a Range, it has to actually create a Vector and generate all of those 20 million values. This takes time (again, 7.5 seconds). When you call map, it iterates through the Vector (cheap), multiplies each number by 2 (still cheap), but then has to create a new Vector for the result (costly). So you've performed the same operations, but created two new vectors of 20 million elements this time. (7.5*2=15 seconds.)

Third, Arrays are very simple data structures and have extremely low overhead. They are fast to create, index, and insert, so constructing a big array and then mapping over it to insert elements into a new array is unsurprisingly fast.

Finally, The .par call on Range produces a ParRange. The result of map is a ParVector, so there's a cost in creating that object and putting 20 million elements into it. When you call .map it creates threads to perform the calculation. However, the operation you are mapping over is extremely fast, so there's really no benefit to doing it in parallel. You are spending much more time with the overhead of dealing with parallelization than actually computing the multiplications.

Think of it this way. If you wanted to do this operation in real life, you'd get together a bunch of friends to be your threads. Then you'd have to divide up your 20 million numbers and give each friend a few to do the multiplications. Then your friend would multiply each number by 2 and give, back the doubled numbers, and wait for you to hand out the next set of numbers. Then you'd have to enter each multiplied number into a new table. But task of multiplying a number by 2 is so fast that you could have just done it yourself in less time than it took you to go through the trouble of getting you friends together and passing messages back and forth. Moreover, if you've only got two cores, that's not a room for parallelization anyway, so it's only a couple threads working simultaneously, and your overhead-to-work ratio is not so good.

I wonder if an array plays nicer with CPU cache than a vector. — 9000, Jul 16 '14 at 18:50
I've found that collection views are the way to go: `val dbls = (1L to 20000000L).view.map(_*2); dbls.sum` The map operation is lazily evaluated until sum() is called. Here is the result: `(Map) multiplying 20 million elements by 2; (Reduce) sum: 400000020000000; Total MapReduce time: 0.783316` @dhg — nmurthy, Aug 03 '14 at 06:45

Map an arithmetical operation to a Scala collection and sum the result

1 Answers1