CompletableFuture on ParallelStream gets batched and runs slower than sequential stream?

Question

Method 1

The usual, very fast, and works great.

public static int loops = 500;
private static ExecutorService customPool = Executors.newFixedThreadPool(loops);
.
.
Instant start = Instant.now();
LongSummaryStatistics stats = LongStream.range(0, loops).boxed()
        .map(number -> CompletableFuture.supplyAsync(() -> DummyProcess.slowNetworkCall(number), customPool))
        .collect(Collectors.toList()).stream() // collect first, else will be sequential
        .map(CompletableFuture::join)
        .mapToLong(Long::longValue)
        .summaryStatistics();

log.info("cf completed in :: {}, summaryStats :: {} ", Duration.between(start, Instant.now()).toMillis(), stats);
// ... cf completed in :: 1054, summaryStats :: LongSummaryStatistics{count=500, sum=504008, min=1000, average=1008.016000, max=1017}

I understand that if I don't collect the stream first, then by nature of laziness, the stream will spring up CompletableFutures one by one, and behave synchronously. So, as an experiment:

Method 2

Remove the intermediate collect step, but make the stream parallel also! :

Instant start = Instant.now();
LongSummaryStatistics stats = LongStream.range(0, loops).boxed()
        .parallel()
        .map(number -> CompletableFuture.supplyAsync(() -> DummyProcess.slowNetworkCall(number), customPool))
        .map(CompletableFuture::join) // direct join
        .mapToLong(Long::longValue).summaryStatistics();

log.info("cfps_directJoin completed in :: {}, summaryStats :: {} ", Duration.between(start, Instant.now()).toMillis(), stats);
// ... cfps_directJoin completed in :: 8098, summaryStats :: LongSummaryStatistics{count=500, sum=505002, min=1000, average=1010.004000, max=1015}

Summary:

Method 1 :: 1 second
Method 2 :: 8 seconds

A pattern I observed:

the parallelstream approach "batches" 60 calls at onces, so with 500 loops, 500/60 ~ 8 batches, each taking 1 second, thus total 8
SO, when I reduce the loop count to 300, there are 300/60 = 5 batches, and it takes 5 seconds to complete actually.

So, the question is:

Why is there this batching of calls in the parallel + direct collection approach?

For completion, here's my dummy network call method:

    public static Long slowNetworkCall(Long i) {
        Instant start = Instant.now();
        log.info(" {} going to sleep..", i);
        try {
            TimeUnit.MILLISECONDS.sleep(1000); // 1 second
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        log.info(" {} woke up..", i);
        return Duration.between(start, Instant.now()).toMillis();
    }

good question. I wonder if this is an artifact of an internal implementation detail of `ForkJoinPool` when threads (above configured parallelism) are blocked. Essentially your `.map(CompletableFuture::join)` will block a `ForJoinPool` thread, and at that time a new one will be spawned to support the configured parallelism. This might (I am not sure) cause somehow the behavior you experience. — Eugene, May 06 '21 at 04:33
yeah, I am more convinced this is "just" the way `ForJoinPool` behaves for when you block its threads. Just run your code with `-Djava.util.concurrent.ForkJoinPool.common.parallelism=50` for example; on my machine it finishes in 2 seconds (so it's not "batching" to `8` always) I wish I could tell you where that computation happens _exactly_ in the `ForkJoinPool`, but that code is insanely complicated. — Eugene, May 06 '21 at 04:52
OK, I _think_ I have a hint. If you log `ForkJoinPool.commonPool().getPoolSize()` at each operation, you will see that it is (after you start blocking) constantly staying at `64`, which, as far as I understand, is the maximum number of threads that can work at a time. So that looks on par with the results that you see. I am not exactly sure this can count as an answer, though. — Eugene, May 06 '21 at 05:07
@Eugene put a followup question, after trying some variations: https://stackoverflow.com/questions/67416858/forkjoinpool-size-increasing-dynamically — Somjit, May 06 '21 at 11:06
The results will likely change depending on what is in that `slowNetworkCall`. You might try putting in actual IO, or a busy loop. Also, invoking logging might create sync points, which would affect the result. Finally, a second is a very long time in current systems. A smaller sleep time might more realistic. — Thomas Bitonti, May 06 '21 at 14:43

score 4 · Accepted Answer · answered May 06 '21 at 14:33

This is an artifact of how ForJoinPool handles things when you block its inner threads, and how many new ones it spawns. Though, I could probably find the exact lines where this happens, I am not sure it is worth it. For two reasons:

that logic can change
the code inside ForkJoinPool is by far not trivial

It seems that for both of us, ForkJoinPool.commonPool().getParallelism() will return 11, so I get the same results as you do. If you log ForkJoinPool.commonPool().getPoolSize() to find out how many active threads is your code using, you will see that after a certain period, it just stabilizes itself at 64. So the max tasks that can be processed at the same time is 64, which is on par with the result that you see (those 8 seconds).

If I run your code with -Djava.util.concurrent.ForkJoinPool.common.parallelism=50, it is now executed in 2 seconds, and the pool size is increased to 256. That means, there is an internal logic that adjusts these kind of things.

CompletableFuture on ParallelStream gets batched and runs slower than sequential stream?

So, the question is:

1 Answers1

Linked