42

I have implemented a method which simply loops around a set of CSV files that contain data on a number of different module. This then adds the 'moduleName' into a hashSet. (Code shown below)

I have used a hashSet as it guarantees no duplicates are inserted instead of an ArrayList which would have to use the contain() method and iterate through the list to check if it is already there.

I believe using the hash set has a better performance than an array list. Am I correct in stating that?

Also, can somebody explain to me:

  1. How to work the performance for each data structure if used?
  2. What is the complexity using the big-O notation?

    HashSet<String> modulesUploaded = new HashSet<String>();
    
    for (File f: marksheetFiles){
        try {
            csvFileReader = new CSVFileReader(f);
            csvReader = csvFileReader.readFile();
            csvReader.readHeaders();
    
            while(csvReader.readRecord()){
                String moduleName = csvReader.get("Module");
    
                if (!moduleName.isEmpty()){
                    modulesUploaded.add(moduleName);
                }
            }
    
        } catch (IOException e) {
            e.printStackTrace();
        }
    
        csvReader.close();
    }
    return modulesUploaded; 
    

    }

keno
  • 2,956
  • 26
  • 39
user1339335
  • 451
  • 1
  • 4
  • 3
  • You probably want to include the language you're using as one of the tags (you'll have to eliminate one of the others, but the language is almost undoubtedly more important). – Jerry Coffin Apr 17 '12 at 17:54

4 Answers4

53

My experiment shows that HashSet is faster than an ArrayList starting at collections of 3 elements inclusively.

A complete results table

| Boost  |  Collection Size  |
|  2x    |       3 elements  |
|  3x    |      10 elements  |
|  6x    |      50 elements  |
|  12x   |     200 elements  |  <= proportion 532-12 vs 10.000-200 elements
|  532x  |  10.000 elements  |  <= shows linear lookup growth for the ArrayList
Andrey Chaschev
  • 16,160
  • 5
  • 51
  • 68
26

They're completely different classes, so the question is: what kind of behaviour do you want?

HashSet ensures there are no duplicates, gives you an O(1) contains() method but doesn't preserve order.
ArrayList doesn't ensure there are no duplicates, contains() is O(n) but you can control the order of the entries.

Alexandre Rondeau
  • 2,667
  • 24
  • 31
biziclop
  • 48,926
  • 12
  • 77
  • 104
22

I believe using the hash set has a better performance than an array list. Am I correct in stating that?

With many (whatever it means) entries, yes. With small data sizes, raw linear search could be faster than hashing, though. Where exactly the break-even is, you have to just measure. My gut feeling is that with fewer than 10 elements, linear look-up is probably faster; with more than 100 elements hashing is probably faster, but that's just my feeling...

Lookup from a HashSet is constant time, O(1), provided that the hashCode implementation of the elements is sane. Linear look-up from a list is linear time, O(n).

Carl Manaster
  • 39,912
  • 17
  • 102
  • 155
Joonas Pulakka
  • 36,252
  • 29
  • 106
  • 169
5

It depends upon the usage of the data structure.

You are storing the data in HashSet, and for your case for storage HashSet is better than ArrayList (as you do not want duplicate entries). But just storing is not the usual intent.

It depends as how you wish to read and process the stored data. If you want sequential access or random index based access then ArrayList is better or if ordering does not matter then HashSet is better.

If ordering matters but you want to do lot of modifications (additions and deletions) the LinkedList is better.

For accessing a particular element HashSet will have time complexity as O (1) and if you would have used ArrayList it would have been O (N) as you yourself have pointed out you would have to iterate through the list and see if the element is not present.

nits.kk
  • 5,204
  • 4
  • 33
  • 55