27

I want to take an input and apply parallel stream on that, then I want output as list. Input could be any List or any collection on which we can apply streams.

My concerns here is that if we want output as map them we have an option from java is like

list.parallelStream().collect(Collectors.toConcurrentMap(args))

But there is no option that I can see to collect from parallel stream in thread safe way to provide list as output. I see one more option there to use

list.parallelStream().collect(Collectors.toCollection(<Concurrent Implementation>))

in this way we can provide various concurrent implementations in collect method. But I think there is only CopyOnWriteArrayList List implementation is present in java.util.concurrent. We could use various queue implementation here but those will not be like list. What I mean here is that we can workaround to get the list.

Could you please guide me what is the best way if I want the output as list?

Note: I could not find any other post related to this, any reference would be helpful.

Stefan Zobel
  • 3,182
  • 7
  • 28
  • 38
Vip
  • 1,448
  • 2
  • 17
  • 20

2 Answers2

46

The Collection object used to receive the data being collected does not need to be concurrent. You can give it a simple ArrayList.

That is because the collection of values from a parallel stream is not actually collected into a single Collection object. Each thread will collect their own data, and then all sub-results will be merged into a single final Collection object.

This is all well-documented in the Collector javadoc, and the Collector is the parameter you're giving to the collect() method:

<R,A> R collect(Collector<? super T,A,R> collector)
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • I think I missed that part. My initial understanding was that collection that we are passing will only collect as single. But my question here now is that then why we need `Collectors.toConcurrentMap` they could have used simple hash map and then combine, return. – Vip May 20 '17 at 08:32
  • 2
    @VipulGoyal this is obviously for optimization purposes. Merging big `HashMap`s can be quite expensive and `ConcurrentHashMap` was already there when they implemented streams, so why not just use it? – Eugene May 20 '17 at 08:34
  • @Eugene I agree with you that it is expensive to merge `HashMap`s. But what I am thinking now is that why we don't have any better implementation for concurrent list other then `CopyOnWriteArrayList ` which is quite expensive. What is the challenge there, or I am missing something? Anyways I got my answer that all together is different discussion. – Vip May 20 '17 at 08:43
  • 3
    @VipulGoyal If the stream (input) and the collection (output) are both ordered, a concurrent collection won't help, because the values must be collected in order. However, if order does not have to be maintained, and collection is concurrent, then all the parallel threads can add to a single result collection, instead by building intermediate subresults that then needs to be merged. – Andreas May 20 '17 at 08:43
  • @Vipul Goyal: merging two `HashMap`s implies rehashing all entries of one the the maps. In contrast, merging two `ArrayList`s implies a single plain memory transfer. Further, keep in mind that `Collectors.toList()` does not specify to return an `ArrayList`, not even a mutable list. So a future version could return a different `List` implementation, easier to merge when building, but unmodifiable afterwards… – Holger May 22 '17 at 13:26
15

But there is no option that I can see to collect from parallel stream in thread safe way to provide list as output. This is entirely wrong.

The whole point in streams is that you can use a non-thread safe Collection to achieve perfectly valid thread-safe results. This is because of how streams are implemented (and this was a key part of the design of streams). You could see that a Collector defines a method supplier that at each step will create a new instance. Those instances will be merged between them.

So this is perfectly thread safe:

 Stream.of(1,2,3,4).parallel()
          .collect(Collectors.toList());

Since there are 4 elements in this stream, there will be 4 instances of ArrayList created that will be merged at the end to a single result (assuming at least 4 CPU cores)

On the other side methods like toConcurrent generate a single result container and all threads will put their result into it.

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • 1
    …assuming at least four CPU cores. – Holger May 22 '17 at 13:29
  • @Holger I am trying to be attentive to details, but you are way above that... :) thx so much for the comment! – Eugene May 22 '17 at 13:43
  • I think you're right overall, but your reasoning was confusing to me. The [Collectors.toList](http://hg.openjdk.java.net/jdk10/jdk10/jdk/file/777356696811/src/java.base/share/classes/java/util/stream/Collectors.java#l275) implementation does (as you say) create a new `ArrayList` for each part of the stream being processed in parallel, but the merge uses a thread-unsafe `addAll` call to merge the second list into the first before returning the first list (rather than creating a new one), though this is still safe because it merges pairs so there's never 2 concurrent `addAll` calls on one list. – Cameron Stone Jun 28 '18 at 17:50
  • @CameronStone right. This is also called fold left. You can try `reduce` without creating a new list all the time for the merge function and see it breaking btw, in parallel – Eugene Jun 28 '18 at 17:56
  • What about the part in `Stream#collect` javadoc saying "If the stream is parallel, and the `Collector` is concurrent, and ..., then a concurrent reduction will be performed (see `Collector` for details on concurrent reduction.)" `Collectors.toList()` creates an `Collector` implementation that is not concurrent. What does it mean then? – Jan Krakora Feb 15 '19 at 12:13
  • @Behnil what does "this mean"? which "this"? the answer or your question in the comments? can you clarify please – Eugene Feb 15 '19 at 15:53
  • @Eugene I mean when they say "concurrent reduction will be performed only when parallel collector is used", and `Collectors.toList()` is not parallel, does use of `Collectors.toList()` together with parallel stream make any sense? – Jan Krakora Feb 15 '19 at 19:42
  • @Behnil `If the stream is parallel` ... `Collector is concurrent` ... `Collectors.toList()` is not a concurrent collector. where is the unclarity here? – Eugene Feb 17 '19 at 01:26
  • 2
    ...assuming at least 5 CPU cores! Stream by default uses the ForkJoinPool and ForkJoinPool.commonPool() size by default is Runtime.getRuntime().availableProcessors() - 1 – herburos Dec 07 '19 at 17:22