TreeSet vs Java 8 Streams performance

Question

Which way is the most efficient to process a distinct and sorted collection?

1. Enhanced loop with TreeSet

Set<MyObj> ret = new TreeSet<>();
for (Foo foo : foos)
    ret.add(new MyObj(foo));

2. Simple Stream

List<MyObj> ret = foos.stream().map(MyObj::new)
                      .distinct().sorted()
                      .collect(Collectors.toList());

3. TreeSet Stream

Set<MyObj> ret = foos.stream().map(MyObj::new)
                     .collect(Collectors.toCollection(TreeSet::new));

The first way seems to be the least elegant yet easy to read. The second way makes me fear that distinct and sorted will process the stream two times. The last way feels okay, but what is the TreeSet overhead in a stream?

Any clues? Thank you.

Why are your collections of type `Foo`, when they contain `MyObj`? — user3707125, Feb 15 '17 at 07:36
`TreeSet` has no [constructor](https://docs.oracle.com/javase/8/docs/api/java/util/TreeSet.html#TreeSet--) using `int` and `float`. Are you sure, you don't mean `HashSet`? — F. Lumnitz, Feb 15 '17 at 07:48
In example 2, why do you call `.distinct`? You say "... to process a distinct ... ", so wouldn't everything be distinct already? — Todd Sewell, Feb 15 '17 at 07:50
Collections have bulk operations like addAll(Collection). They are often more efficient than per-element operations. Why don't you try them out using a benchmarking tool such as http://openjdk.java.net/projects/code-tools/jmh/ — jackrabbit, Feb 15 '17 at 07:55
@F.Lumnitz : Look at the comment, I simplified it. You have to create the Map and TreeMap yourself to acheive this. — Guillaume F., Feb 15 '17 at 08:02
@ToddSewell : The Foo list is not distinct. I want to do a distinct and a sort (which TreeSet does). — Guillaume F., Feb 15 '17 at 08:02
@jackrabbit : I have to transform Foo to MyObj, addAll won't help in this case. Will it? Before making a benchmark, I wanted to know if anybody knows the complexity of each method. — Guillaume F., Feb 15 '17 at 08:02
Is performance really that important here? Make sure you don't fall into the trap op [premature optimization](http://softwareengineering.stackexchange.com/questions/80084/is-premature-optimization-really-the-root-of-all-evil). — Todd Sewell, Feb 15 '17 at 08:06
@ToddSewell It is, the algorithm has to deal with many asynchronous queries and a decent amount of data to process each time. I'm currently optimizing all the slow parts, and this is one of them. — Guillaume F., Feb 15 '17 at 08:11
@GuillaumeF. Can you provide the exact code for the first case without simplification? It is unclear what `new TreeSet(size, 1.0f)` means. — ZhekaKozlov, Feb 15 '17 at 09:53
@ZhekaKozlov I just removed it from the question for clarity. It was a custom constructor I use to initialise the TreeSet's internal TreeMap with a size and loadfactor. — Guillaume F., Feb 15 '17 at 19:51

Tagir Valeev · Accepted Answer · 2017-02-15T08:43:19.187

Initial analysis

Judging from the Stream API source code, my initial guess would be: for many items simple stream (2) should be the fastest, outperforming significantly the TreeSet version (1), then TreeSet stream (3) should follow a little bit behind. For short data sets (1) would probably be better than (2) which is better than (3), because Stream creation adds some overhead. The distinct-sorted stream works roughly like this:

Set<MyObj> set = new HashSet<>();
List<MyObj> result = new ArrayList<>();
for (Foo foo : foos) {
    MyObj myObj = new MyObj(foo);
    if(set.add(myObj))
        result.add(myObj);
}
result.sort(null);
return result;

Let's add this implementation as (4). It uses HashSet to check whether results are distinct, adding them into intermediate container, then sorts it. This should be faster than maintaining TreeSet as we don't need to keep order after every insertion (which TreeSet does, possibly rebalancing the tree). Actual Stream implementation would be somewhat less efficient, because it cannot sort the resulting list in-place. Instead it creates intermediate container, sorts it, then dumps the result into the final list using series of list.add calls.

The result could depend on number of elements in initial foos collection and also on number of distinct elements. I call it diversity: diversity = 1 means that roughly every element is different; diversity = 0.5 means that every element is repeated roughly two times. Also the result may heavily depend on initial element order: sorting algorithms could be order of magnitude faster when input data is presorted or nearly presorted.

Experimental setup

So let's parameterize our tests in the following way:

size (number of elements in foos): 10, 1000, 100000
diversity (fraction of different ones): 1, 0.5, 0.2
presorted: true, false

I assume that Foo contains only one int field. Of course the result might heavily depend on compareTo, equals and hashCode implementation of Foo class, because versions (2) and (4) use equals and hashCode while versions (1) and (3) use compareTo. We will do it simply:

@Override
public int hashCode() {
    return x;
}

@Override
public boolean equals(Object o) {
    return this == o || (o != null && getClass() == o.getClass() && x == ((Foo) o).x);
}

@Override
public int compareTo(Foo o) {
    return Integer.compare(x, o.x);
}

Presorted elements could be generated via:

foos = IntStream.range(0, size)
                .mapToObj(x -> new Foo((int)(x*diversity)))
                .collect(Collectors.toList());

Random elements could be generated via:

foos = new Random().ints(size, 0, (int) (size * diversity))
                   .mapToObj(Foo::new)
                   .collect(Collectors.toList());

Using JMH 1.13 and JDK 1.8.0_101, VM 25.101-b13 64bit to perform measurements

Results

Presorted (all times are in μs):

diversity size      (1)      (2)      (3)      (4)
  1         10      0.2      0.5      0.3      0.2
  1       1000     48.0     36.0     53.0     24.2
  1     100000  14165.7   4759.0  15177.3   3341.6
0.5         10      0.2      0.3      0.2      0.2
0.5       1000     36.9     23.1     41.6     20.1
0.5     100000  11442.1   2819.2  12508.7   2661.3
0.2         10      0.1      0.3      0.2      0.2
0.2       1000     32.0     13.0     29.8     16.7
0.2     100000   8491.6   1969.5   8971.9   1951.7

Not presorted:

diversity size      (1)      (2)      (3)      (4)
  1         10      0.2      0.4      0.2      0.3
  1       1000     72.8     77.4     73.6     72.7
  1     100000  21599.9  16427.1  22807.8  16322.2
0.5         10      0.2      0.3      0.2      0.2
0.5       1000     64.8     46.9     69.4     45.5
0.5     100000  20335.2  11190.3  20658.6  10806.7
0.2         10      0.1      0.3      0.2      0.2
0.2       1000     48.0     19.6     56.7     22.2
0.2     100000  16713.0   5533.4  16885.0   5930.6

Discussion

My initial guesses were in general correct. For presorted data (2) and (4) are times better when we have 100,000 elements. The difference becomes even bigger when we have many duplicates, as they don't increase sorting time and duplicate insertion to HashSet is much more efficient than duplicate insertion to TreeSet. For random data the difference is less striking as TreeSet performance much less depend on the input data order than TimSort algorithm (which is used by Java to sort lists and arrays). For small data sets simple TreeSet is fast, but using (4) version could be competitive as well.

A source code of benchmark along with raw results is available here.

Apparently, you took the freedom to modify `(1)`, as the question’s code doesn’t even compile. This is more than just a typo as the question even refers to the non-existent “*maximum load*” of a `TreeSet`. So what you measured is not what the OP described. Besides that, the variants are incomparable, as they do entirely different things, some produce a `List`, some produce a `TreeSet`. Since these different results can influence dramatically the performance of the follow-up operations working with them, it’s not useful to compare their creation in isolation. — Holger, Feb 15 '17 at 12:42
Regarding `(4)`, I’d populate the `HashSet` first and create the `ArrayList` in one go via `new ArrayList<>(set)` afterwards. This avoids capacity increase operations for the `ArrayList`. — Holger, Feb 15 '17 at 12:45
@Holger, In (4) I wanted to demonstrate roughly how stream works, not to suggest the most optimal solution. And indeed, the question has some unclear points. You may consider suggesting edits to OP in question comments. — Tagir Valeev, Feb 15 '17 at 12:55
@TagirValeev , thanks a lot for all these details. I removed the custom constructor from `(1)` in the question, so your answer fits perfectly now. I tested various cases following your examples and managed to improve the code performance by about 40%, very nice. Cheers! — Guillaume F., Feb 15 '17 at 19:58

score 11 · Answer 2 · answered Feb 15 '17 at 08:28

It is hard to give a good answer without analyzing the input. Anyways I'll share my results:

I made Foo a container for a single long, and MyObj a container for single Foo. Also I made all tests end with copying data into a plain array. Also I've added two approaches:

4). Simple array

@Benchmark
public void simpleArray(Blackhole bh) {
    MyObj[] ret = new MyObj[foos.size()];
    for (int i=0;i<foos.size();i++)
        ret[i] = new MyObj(foos.get(i));
    Arrays.sort(ret);
    int lastDistinct = 0;
    for (int i = 1; i < ret.length; i++) {
        if (ret[i].equals(ret[lastDistinct])) {
            continue;
        }
        lastDistinct++;
        ret[lastDistinct] = ret[i];
    }
    MyObj[] ret2 = new MyObj[lastDistinct + 1];
    System.arraycopy(ret, 0, ret2, 0, lastDistinct + 1);
    bh.consume(ret2);
}

5). Reversed order of distinct and order of (2):

@Benchmark
public void simpleStream_distinctAfterSort(Blackhole bh) {
    List<MyObj> ret = foos.stream().map(MyObj::new)
            .sorted().distinct()
            .collect(Collectors.toList());
    bh.consume(ret.toArray(new MyObj[ret.size()]));
}

Tests setup:

public static final int MAX_SIZE = 10_000;
public static final long ELEM_THRESHOLD = MAX_SIZE * 10;
private List<Foo> foos;

@Setup
public void init() throws IOException, IllegalAccessException, InstantiationException {
    foos = new ArrayList<>(MAX_SIZE);
    for (int i = 0; i < MAX_SIZE; i++) {
        foos.add(new Foo(ThreadLocalRandom.current().nextLong(ELEM_THRESHOLD)));
    }
}

Now results with different size and threshold:

Size=10_000
Threshold=Size*10
Benchmark                                         Mode  Cnt    Score   Error  Units
StreamBenchmark5.enhancedLoop_TreeSet            thrpt    2  478,978          ops/s
StreamBenchmark5.simpleArray                     thrpt    2  591,287          ops/s
StreamBenchmark5.simpleStream                    thrpt    2  407,556          ops/s
StreamBenchmark5.simpleStream_distinctAfterSort  thrpt    2  533,091          ops/s
StreamBenchmark5.treeSetStream                   thrpt    2  492,709          ops/s

Size=10_000
Threshold=Size/10
StreamBenchmark5.enhancedLoop_TreeSet            thrpt    2   937,908          ops/s
StreamBenchmark5.simpleArray                     thrpt    2   593,983          ops/s
StreamBenchmark5.simpleStream                    thrpt    2  3344,508          ops/s
StreamBenchmark5.simpleStream_distinctAfterSort  thrpt    2   560,652          ops/s
StreamBenchmark5.treeSetStream                   thrpt    2  1000,585          ops/s

Size=500_000
Threshold=Size*10
Benchmark                                         Mode  Cnt  Score   Error  Units
StreamBenchmark5.enhancedLoop_TreeSet            thrpt    2  1,809          ops/s
StreamBenchmark5.simpleArray                     thrpt    2  4,009          ops/s
StreamBenchmark5.simpleStream                    thrpt    2  2,443          ops/s
StreamBenchmark5.simpleStream_distinctAfterSort  thrpt    2  4,141          ops/s
StreamBenchmark5.treeSetStream                   thrpt    2  2,040          ops/s

Size=500_000
Threshold=Size/10
Benchmark                                         Mode  Cnt   Score   Error  Units
StreamBenchmark5.enhancedLoop_TreeSet            thrpt    2   5,724          ops/s
StreamBenchmark5.simpleArray                     thrpt    2   4,567          ops/s
StreamBenchmark5.simpleStream                    thrpt    2  19,001          ops/s
StreamBenchmark5.simpleStream_distinctAfterSort  thrpt    2   4,840          ops/s
StreamBenchmark5.treeSetStream                   thrpt    2   5,407          ops/s

Size=1_000_000
Threshold=Size/100
Benchmark                                         Mode  Cnt   Score   Error  Units
StreamBenchmark5.enhancedLoop_TreeSet            thrpt    2   4,529          ops/s
StreamBenchmark5.simpleArray                     thrpt    2   2,402          ops/s
StreamBenchmark5.simpleStream                    thrpt    2  35,699          ops/s
StreamBenchmark5.simpleStream_distinctAfterSort  thrpt    2   2,232          ops/s
StreamBenchmark5.treeSetStream                   thrpt    2   4,889          ops/s

As you can see depending on the amount of duplicates preferable algorithm changes. The most balanced approach is TreeSet (3), however the fastest one is almost always the simple stream (with order and distinct located corresponding to input data).

Here's the test source if you are willing to play a bit. You'll need JMH.

Thank you for this very detailed answer, you and Tagir have been very helpful. Cheers! — Guillaume F., Feb 15 '17 at 19:52