stream().collect(Collectors.toSet()) vs stream().distinct().collect(Collectors.toList())

Question

If i have a list (~200 elements) of objects, with only few unique objects (~20 elements). I want to have only unique values. Between list.stream().collect(Collectors.toSet()) and list.stream().distinct().collect(Collectors.toList()) which is more efficient wrt latency and memory consumption ?

https://stackoverflow.com/questions/41593538/is-it-worth-using-distinct-with-collecttoset — pvpkiran, Feb 26 '18 at 17:57
In this specific case, `new HashSet<>(list)` resp. `new ArrayList<>(new HashSet<>(list))` might turn out to be more efficient than the stream operations and the only remaining question is what actual result type do you need… — Holger, Feb 27 '18 at 10:31

score 22 · Answer 1 · edited Feb 26 '18 at 21:54

22

While the answer is pretty obvious - don't bother with these details of speed and memory consumption for this little amount of elements and the fact that one returns a Set and the other a List; there are some interesting small details (interesting IMO).

Suppose you are streaming from a source that is already known to be distinct, in such a case your .distinct() operation will be a NO-OP; because there is no need to actually do anything.

If you are streaming from a List (which is by design ordered) and there are no intermediate operations (unordered for example) that change the order, .distinct() will be forced to preserve the order, by using a LinkedHashSet internally - pretty expensive.

If you are doing parallel processing, list.stream().collect(Collectors.toSet()) version will merge multiple HashSets (in 9 this has been slightly improved vs 8), .distinct() on the other hand, will spin a ConcurrentHashMap that will keep all the keys with a dummy Boolean.TRUE value (it's also doing something interesting to preserve the null that your stream might have - even this internally is handled differently in two cases)

edited Feb 26 '18 at 21:54

Ousmane D.

54,915
8
91
126

answered Feb 26 '18 at 21:05

Eugene

117,005
15
201
306

1

Very interesting point. Especially the first part on sequential streams with `distinct()` and ordered collections that we should consider for very large collections. – davidxxx Feb 26 '18 at 21:44
Hello. About the `LinkedHashSet` used internally, I tested and I noticed that it is only for the parallel stream case( `DistinctOps.makeRef(...).reduce()` used by parallel evaluation methods). If I am wrong, tell me. – davidxxx Feb 27 '18 at 09:22
1

@davidxxx sounds plausible, for a sequential stream, there is no need for a `LinkedHashSet` to maintain the encounter order, a `HashSet` will do. But there are so many optimizations… E.g., if the source is sorted to the natural order, another code path will be taken, not using a `Set` at all. – Holger Feb 27 '18 at 10:29
@Holger Thanks for that. But why would the encounter order matter for a parallel stream ? – davidxxx Feb 27 '18 at 18:50
2

@davidxxx the stream always maintains the encounter order, if there is one. But the needed effort differs. For a sequential stream, you can do something like `/*loop logic */ { if(set.add(element)) consumer.accept(element); }` and the consumer will receive the elements in the encounter order; the order of the set doesn’t matter as it’s never iterated. For a parallel stream, threads may have to defer elements to see whether other threads have already seen them. Then, they have to iterate over the set’s (remaining) elements, which requires the set to maintain the insertion order. – Holger Feb 28 '18 at 09:49
@Holger It makes totally sense. Thanks a lot :) With so much synchronization (although required), it is not so surprising that it is hard to find use cases where parallel streams are more efficient. – davidxxx Feb 28 '18 at 11:12
@davidxxx well, when the order doesn’t matter, inserting an `unordered()` before `distinct()` can make a huge difference… – Holger Feb 28 '18 at 12:20

davidxxx · Answer 2 · 2018-02-26T17:59:12.853

A Set (typically HashSet) consumes more than a List (typically ArrayList), mainly because of the hashing table that it stores. But with so few elements, you will not get a noticeable difference in terms of memory consumption.
Instead, which you should care about is that these collectors return different things : a List and a Set that have their own specificities, particularly as as you access to their elements.
So use the way that matches to what you want to perform with this collection.

stream().collect(Collectors.toSet()) vs stream().distinct().collect(Collectors.toList())

2 Answers2

Linked