2

I am just starting to learn about the Streams and parallel in Java and I was wondering why a normal for loop takes less time than IntStream paralleled at adding items to an array.

package parallel;

import java.util.stream.IntStream;

public class Parallel {

    public static void main(String[] args) {
         final int[] intArray = new int[100000];
        long startTime = System.currentTimeMillis(); 
        IntStream.range(0, 100000).parallel().forEach(i ->  intArray[i]=i);
        long endTime = System.currentTimeMillis();
        System.out.println("Parallel time: " + (endTime-startTime));
        final int[] intArray2 = new int[100000];
        try {
            Thread.sleep(100);
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        startTime = System.currentTimeMillis();
        for(int i = 0; i < 100000; i++){
            intArray2[i] = i;
        }
        endTime = System.currentTimeMillis();
        System.out.println("Non parallel time: " + (endTime-startTime));
    }
}

Getting results like this.

Parallel time: 110

Non parallel time: 7

Zombies
  • 25,039
  • 43
  • 140
  • 225
drhunn
  • 21
  • 1
  • 2
  • 13
  • 1
    First, benchmarks are notoriously difficult to perform accurately. And microbenchmarks are worse then useless. I will point out that your array has already been constructed and you loop and set a value while the parallel foreach doesn't start with a preinitialised value. – Elliott Frisch Nov 10 '14 at 06:50
  • 1
    There are two different arrays being filled (`intArray` and `intArray2`), so both cases start from a "clean" array. – cello Nov 10 '14 at 07:00
  • 1
    The problem is your measurement methodology, full stop. The biggest problem with this measurement approach is that it _prints a number_, which humans are tempted to interpret as having meaning. It does not. – Brian Goetz Dec 19 '19 at 17:06

1 Answers1

8

The operation you perform for each element is very simple, it's just an assignment, which is very fast. In the parallel version, you have a lot of overhead by starting the multiple threads that handle the operations. This alone will likely already take longer than what the very simple operation takes when applied non-parallel.

Also, in the non-parallel version, the values are written very linearly to the array, which CPU architecture already has single-threat/single-core optimizations for quite awhile, as do compilers and intermediary compilers (which convert code like C to assembly). In the parallel version though, you'll might get conflicts as each thread tries to write to the same array (although on different positions, but probably still on the same cache line), and as several threads access different parts of the array, you might also get cache misses which slow things down.

With a more expensive operation, the overhead of the parallel version becomes smaller compared to the total costs, which will in the end result in faster execution than the non-parallel case.

Zombies
  • 25,039
  • 43
  • 140
  • 225
cello
  • 5,356
  • 3
  • 23
  • 28
  • 1
    I doubt it would cause so much difference though - a flawed benchmark is probably part of the explanation too. – assylias Nov 10 '14 at 07:07
  • 2
    sure, it should be executed a lot of times after a decent JVM warmup and then averaged, as otherwise there will be too much random noise in the results. But even with a correct setup my guess would be that with the above code, the parallel-one would be still be slower (just based on personal experience, a correct benchmark would prove me right or wrong ;) ). – cello Nov 10 '14 at 07:11
  • 2
    You suggest that `parallel().forEach` would distribute the work across threads interleaving i values, e.g. each of 4 threads doing `i+=4` from different starting points. That would make the interface unusably bad for doing anything with an array; surely a sane implementation would break the range up into contiguous sub-ranges of `i` values so each thread only touches some of the cache lines. Thread startup overhead and flawed benchmarking (no warmup for the JVM, CPU frequency, or array page faults) is still enough to explain a big perf difference for a simple memset loop over only 400kB. – Peter Cordes Dec 19 '19 at 15:50