54

When processing large amounts of data I often find myself doing the following:

HashSet<String> set = new HashSet<String> ();
//Adding elements to the set
ArrayList<String> list = new ArrayList<String> (set);

Something like "dumping" the contents of the set in the list. I usually do this since the elements I add often contain duplicates I want to remove, and this seems like an easy way to remove them.

With only that objective in mind (avoiding duplicates) I could also write:

ArrayList<String> list = new ArrayList<String> ();
// Processing here
if (! list.contains(element)) list.add(element);
//More processing here

And thus no need for "dumping" the set into the list. However, I'd be doing a small check before inserting each element (which I'm assuming HashSet does as well)

Is any of the two possibilities clearly more efficient?

Jorge
  • 1,574
  • 2
  • 12
  • 14
  • You have your first part of the question wrong. You're dumping list in the set to get rid of duplicates, not the other way around, right? – MirMasej Sep 13 '15 at 17:11
  • Why don't you test it? Btw why bother with converting the set into a list anyways? Going through set will most probably be faster for large arrays. – luk32 Sep 13 '15 at 17:11
  • Hi, thank you for your comments. In this scenario I populate my set with the data (to avoid duplicates) and then dump it to a list, that way I effectively get a List with no dupes. If I didn't need the list I would not actually create one, but sometimes a sort is applied afterwards, and some of the code I work with requires lists. – Jorge Sep 13 '15 at 17:56

6 Answers6

105

The set will give much better performance (O(n) vs O(n^2) for the list), and that's normal because set membership (the contains operation) is the very purpose of a set.

Contains for a HashSet is O(1) compared to O(n) for a list, therefore you should never use a list if you often need to run contains.

Dici
  • 25,226
  • 7
  • 41
  • 82
  • 10
    What if the list contains only a few elements? – Ivan Balashov Aug 28 '17 at 07:25
  • 8
    Complexity calculation doesn't really apply to bounded problems. Its goal is to understand how much slower the computation becomes when the problem size increases, becoming infinitely large. That said, I don't think there is ever an advantage at using a list over a hash set for the `contains` operation. Sure, a set has a larger memory overhead in general, but if you have a few elements only why would you even care ? More efficient set implementations exist for bounded datasets (`EnumSet` for example), but generally a simple hash set should be than enough for typical performance requirements – Dici Aug 28 '17 at 22:50
  • 6
    Often we already have an ephemeral list for which we need to run `.contains`. The question is, from which size does it make sense to create a Set? Under 10 elements both perform on the scale of 1-2 micros, but we spend time to create a Set. Anyway, here is quick benchmark if somebody interested https://gist.github.com/ibalashov/0138e850e58942569a636dffa75f0bb9 – Ivan Balashov Aug 30 '17 at 06:43
  • @Dici to be exact, it's *amortized* `O(1)`. This has little to do with duplicates though, `List::contains` will stop at the very first duplicate anyway; it's more about the `hashing` structure of `HashSet` here that gives that much of a boost – Eugene Sep 09 '18 at 21:15
  • @Eugene also, I'm well aware about the various ways of implementing a hash table, what I meant in this answer is that it's not surprising that set membership (that the OP is using here for avoiding duplicates) is more efficient for a `Set` because it is literally made for that. I guess the phrasing wasn't great though. – Dici Sep 09 '18 at 21:18
18

The ArrayList uses an array for storing the data. The ArrayList.contains will be of O(n) complexity. So essentially searching in array again and again will have O(n^2) complexity.

While HashSet uses hashing mechanism for storing the elements into their respective buckets. The operation of HashSet will be faster for long list of values. It will reach the element in O(1).

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
9

I've made a test so please check the result:

For SAME STRING items in a HashSet, TreeSet, ArrayList and LinkedList, here are the results for

  1. 50.000 UUIDs
    • SEARCHED ITEM : e608c7d5-c861-4603-9134-8c636a05a42b (index 25.000)
    • hashSet.contains(item) ? TRUE 0 ms
    • treeSet.contains(item) ? TRUE 0 ms
    • arrayList.contains(item) ? TRUE 2 ms
    • linkedList.contains(item) ? TRUE 3 ms
  2. 5.000.000 UUIDs
    • SEARCHED ITEM : 61fb2592-3186-4256-a084-6c96f9322a86 (index 25.000)
    • hashSet.contains(item) ? TRUE 0 ms
    • treeSet.contains(item) ? TRUE 0 ms
    • arrayList.contains(item) ? TRUE 1 ms
    • linkedList.contains(item) ? TRUE 2 ms
  3. 5.000.000 UUIDs
    • SEARCHED ITEM : db568900-c874-46ba-9b44-0e1916420120 (index 2.500.000)
    • hashSet.contains(item) ? TRUE 0 ms
    • treeSet.contains(item) ? TRUE 0 ms
    • arrayList.contains(item) ? TRUE 33 ms
    • linkedList.contains(item) ? TRUE 65 ms

Based on above results, there is NOT a BIG difference of using array list vs set. Perhaps you can try to modify this code and replace the String with your Object and see the differences then...

    public static void main(String[] args) {
        Set<String> hashSet = new HashSet<>();
        Set<String> treeSet = new TreeSet<>();
        List<String> arrayList = new ArrayList<>();
        List<String> linkedList = new LinkedList<>();

        List<String> base = new ArrayList<>();

        for(int i = 0; i<5000000; i++){
            if(i%100000==0) System.out.print(".");
            base.add(UUID.randomUUID().toString());
        }

        System.out.println("\nBase size : " + base.size());
        String item = base.get(25000);
        System.out.println("SEARCHED ITEM : " + item);

        hashSet.addAll(base);
        treeSet.addAll(base);
        arrayList.addAll(base);
        linkedList.addAll(base);

        long ms = System.currentTimeMillis();
        System.out.println("hashSet.contains(item) ? " + (hashSet.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
        System.out.println("treeSet.contains(item) ? " + (treeSet.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
        System.out.println("arrayList.contains(item) ? " + (arrayList.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
        System.out.println("linkedList.contains(item) ? " + (linkedList.contains(item)? "TRUE " : "FALSE") + (System.currentTimeMillis()-ms) + " ms");
    }
urs86ro
  • 99
  • 1
  • 3
  • 6
    "Based on above results, there is NOT a BIG difference of using array list vs set". From your numbers, this is clearly not the case; for 5 million UUIDs, an ArrayList is at least 33x slower than either a TreeSet or a HashSet when the element is in the middle of the Collection. – Abhishek Divekar Apr 25 '18 at 05:59
  • 1
    This benchmark is too small to be conclusive, and your interpretation of what it does show is incorrect as mentionned by abhi. – Dici Jul 25 '18 at 21:39
  • 2
    Classic assumptions about small time differences: 2-3ms doesn't sound like much. Now imagine that your code is in a tight loop iterating through 10,000 items, performing a 'contains' for each one. Those extra 2-3ms just caused an extra 20-30 second delay!!! I have been in situations where I have made incredible performance improvements by shaving 2-3ms of a particular operation in a client facing app. Just have to pick your optimizations: No use saving 2ms on something called once every hour but on something called thousands of times in a short period of time... hell yeah! – Volksman Jul 22 '19 at 21:33
  • If my math is right, based on your results, HashSet and TreeSet are infinitely faster than ArrayList and LinkedList in all your tests.: 2ms/0ms -> infinite. – cquezel Aug 31 '22 at 15:54
5

If you don't need a list, I would just use a Set and this is the natural collection to use if order doesn't matter and you want to ignore duplicates.

You can do both is you need a List without duplicates.

private Set<String> set = new HashSet<>();
private List<String> list = new ArrayList<>();


public void add(String str) {
    if (set.add(str))
        list.add(str);
}

This way the list will only contain unique values, the original insertion order is preserved and the operation is O(1).

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
1

You could add elements to the list itself. Then, to dedup -

HashSet<String> hs = new HashSet<>(); // new hashset
hs.addAll(list); // add all list elements to hashset (this is the dedup, since addAll works as a union, thus removing all duplicates)
list.clear(); // clear the list
list.addAll(hs); // add all hashset elements to the list

If you just need a set with dedup, you can also use the addAll() on a different set, so that it will only have unique values.

Prateek Paranjpe
  • 513
  • 3
  • 13
1

I did a small trivial test of the "contains" method using random strings on Java 17 using TreeSet, HashSet and ArrayList.

The break even point is around 5 elements in the collections. 4 or less elements, ArrayList is faster. 6 or more elements, HashMap is faster.

Intuitively, i would have thought that the 5 value would be much higher and that TreeSet would have outperformed HashSet for smaller sizes.

cquezel
  • 3,859
  • 1
  • 30
  • 32
  • It would also be interesting to know what the relationship is for INTEGERS instead of STRINGS, because .contains() in ArrayList uses `equals()` whereas .contains() in a Map first uses the `hashCode()`, which is an integer. Completely comparing Strings is much slower than comparing integers, so a .contains() in an ArrayList of Integers could outperform any Map implementation for quite more than 5 entries, even if each number is unique. – Dreamspace President Jan 27 '23 at 13:21
  • 1
    @DreamspacePresident Also the fact that String class' hashCode() is lazy calculated certainly does not help my "ballpark" test. – cquezel Jan 27 '23 at 13:50