Why there is such a big difference with and without using `par` (snippet provided)?

Question

Running this program shows the below results:

object ParallelTest {
  def main(args: Array[String]) {
    val start = System.nanoTime()
    val list = (1 to 10000).toList.par
    println("with par: elapsed: " + (System.nanoTime() - start) / 1000000 + " milliseconds")

    val start2 = System.nanoTime()
    val list2 = (1 to 10000).toList
    println("without par: elapsed: " + (System.nanoTime() - start2) / 1000000 + " milliseconds")
  } 
}

.

with par: elapsed: 238 milliseconds 
without par: elapsed: 0 milliseconds

If I understand these results, using par takes longer because "parallelizing" a List requires copying the contents to a parallel data structure?

Be careful with such microbenchmarks. There is plenty of side factors that can affect performance like JIT or garbage collection. — ghik, Aug 28 '13 at 17:19
But, isn't it expected that simply calling `x.toList.par` would take longer than `x.toList` since `par` involves copying the non-parallel data into a new, parallel data structure? Source - http://docs.scala-lang.org/overviews/parallel-collections/overview.html @ "Creating a Parallel Collection" — Kevin Meredith, Aug 28 '13 at 18:09
See this: http://docs.scala-lang.org/overviews/parallel-collections/performance.html and e.g. this: http://stackoverflow.com/questions/6642210/dealing-with-the-surprising-lack-of-parlist-in-scala-collections-parallel — axel22, Aug 28 '13 at 18:25
Based on these helpful replies and links, the bottom line (if I understand) is that using `par` in this case adds overhead to copy items from a `List` to a parallel collection. However, this overhead is minimal. The 238 ms difference that I saw occurred as a result of one or more side factors (JIT, garbage collection, which JVM I'm using for optimization, etc.) The accepted answer demonstrates the ~1 ms difference between a test with and without `par`. — Kevin Meredith, Aug 29 '13 at 16:35

score 1 · Accepted Answer · answered Aug 29 '13 at 02:30

1

When I load this into my REPL and do ParallelTest.main(Array()) twice:

scala> ParallelTest.main(Array())
with par: elapsed: 23 milliseconds
without par: elapsed: 1 milliseconds

scala> ParallelTest.main(Array())
with par: elapsed: 1 milliseconds
without par: elapsed: 0 milliseconds

Almost all of what you are seeing is JIT warmup. Hotspot optimizes the relevant methods after the first loop, and we see the benefits in the next three iterations. Proper benchmarking on the JVM requires throwing away the first few results.

answered Aug 29 '13 at 02:30

wingedsubmariner

13,350
1
27
52

Since `par` will copy all data from one object to a thread-safe object, isn't that overhead expected to take more time? – Kevin Meredith Aug 29 '13 at 02:34
@Kevin List is already thread-safe, but unfortunately isn't well adapted to parallel algorithms, so in this case it is converted to a ParVector. Yes, this will imply some overhead, but it is much smaller than your initial tests showed. – wingedsubmariner Aug 29 '13 at 02:42
@Kevin parallel and concurrent (thread-safe) are different. Also, look at these results in nanos, and observe that `toList` and `par` take equal time in copying. And `(1 to N).to[ParVector]` takes same. But `(1 to N).par.to[ParVector]` is variable, and struggles not to be slower; on my machine, sometimes it gets in the ballpark of `toList`. But on large N it actually wins. – som-snytt Aug 29 '13 at 10:51
@som-snytt, when you say "But on large N it actually wins," you're saying that `par` wins? – Kevin Meredith Aug 29 '13 at 16:28
@Kevin Yes, see conclusion to my answer. This answer says par overhead is trivial, but list.par is not negligible; the other answer says you need to have real work to parallelize to make it worth it; but in fact, the mere work of building a largish (1M ints) trivially filled ParVector in parallel already beats going through sequential List. I would have bet on sequential range.toList for bigger N, since range and List receive so much attention. Anyway, measure and learn. – som-snytt Aug 29 '13 at 22:57

score 1 · Answer 2 · edited May 23 '17 at 12:21

I am as idly curious about meaningless microbenchmarks as the next hacker, so here is a demonstration of why the result is meaningful, why it matters where you put the par and why the OP's conjecture was correct (if the methodology was flawed):

scala> import System.nanoTime
import System.nanoTime

scala> def timed(op: =>Unit) = { val t0=nanoTime;op;println(nanoTime-t0) }
timed: (op: => Unit)Unit

scala> val data = (1 to 1000000).toList
data: List[Int] = List(1, 2, 3, 4,...

scala> timed(data.par)
85333715

scala> timed(data.par)
40952638

scala> timed(data.par)
40134628

On my machine, constructing a small 10k list takes the same time as calling par on it, around 400k nanos, which is why, in the green checked answer, .toList.par rounds up to one and .toList rounds down to zero.

OTOH, constructing a large 1m list sequentially is more variable.

scala> 1 to 100 foreach (_ => timed((1 to 1000000).toList))

loses a factor of ten somewhere. I haven't looked to see whether that is due to reallocations, garbage collection, memory architecture or what.

But it's interesting how easily this works:

scala> 1 to 100 foreach (_ => timed((1 to 1000000).par.to[ParVector]))

The ParRange edges out the sequential Range in this test and is faster than data.par. (On my machine.)

What's interesting to me is that there is no computation to parallelize here.

This must mean that it's inexpensive to assemble a ParVector in parallel. Compare this other answer where the costs of assembly in a parallel groupBy were surprising to me as a ParNewbie.

score 0 · Answer 3 · answered Aug 29 '13 at 10:22

Others have remarked about the difficulty of doing microbenchmarks on the JVM because of non-deterministic warm-up uncertainties. I'd like to raise a different topic.

The parallel collections framework needs using with care. All attempts to improve the speed of software via parallelisation are subject to Amdahl's Law: The speedup of a program using parallel processors is limited by the time needed for the sequential fraction of the program.

So it's important that the parallel collections are applied only when a real application that might use them can be reliably (and consistently!) benchmarked to determine which parts are worth attempting in parallel and which are not. Fortunately, it's relatively easy to switch between parallel and sequential collections to compare their use.

Also, using parallel programs to improve speed is a related but different issue from using concurrency to express a solution. Actors provide for this in Scala. Go, Occam and other languages depend on CSP communicating process architectures instead to provide a finer-grained and mathematically-based expression of concurrency (and there is current work to support CSP in Scala too). Typically, concurrent programs will be more amenable to parallel processing than just sequential programs with parallel collections, largely because of Amdahl's Law. Parallel collections will prove useful only with relatively large data sets and relatively heavy processing load per element.

Uh, the exact opposite is true. Amdahl's Law will damn CSP and Actors as well, in fact it is worse in there case. For a series of CSP processes or actors your system ends up being limited by the slowest part of the pipeline, a particularly damning instance of Amdahl's law. Embarrassingly parallel operations, like what the parallel collections library are meant to solve, will scale very nicely up to any number of cores. — wingedsubmariner, Aug 29 '13 at 23:00
Also, Amdahl's law never implies that there won't be benefits from parallelization, only diminishing returns as the parallelized part of the program takes up less and less of the total runtime. — wingedsubmariner, Aug 29 '13 at 23:02
As you say, *embarassingly* parallel operations will speed up; that's the nature of it being *embarassingly* easy to parallelise. You are correct that CSP, actors, parallel collections will *all* need to deal with Amdahl's Law, but you are wrong that CSP would be worse; it depends on the case. Whilst people working with communicating process architectures are already au fait with handling this, I have experienced a naivety in some people expecting unspecified magic from parallel collections that invites some remarks on the topic. — Rick-777, Aug 30 '13 at 09:19

Why there is such a big difference with and without using `par` (snippet provided)?

3 Answers3