47

Looking at the source of Java 6, HashSet<E> is actually implemented using HashMap<E,Object>, using dummy object instance on every entry of the Set.

I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.

But, why is it still used? Is there any reason to use it besides making it easier to maintain the code?

starball
  • 20,030
  • 7
  • 43
  • 238
Randy Sugianto 'Yuku'
  • 71,383
  • 57
  • 178
  • 228
  • 7
    @yuku: the level of waste in the default Java collections is mindboggling. The worst offenders happens when you manipulate primitives. You think a HashSet is bad? No think about this: HashMap. If you're after efficient collections you want to look at Trove (for primitives) or Javolution (real time). They both run around circles the default Java collections, both performance and memory wise. We're doing heavy number crunching and collections with millions of elements are common for us. Trove rocks. Javolution rocks. The default Java collections simply don't cut it. – SyntaxT3rr0r Feb 10 '10 at 09:25
  • 1
    @yuku: to continue on my comment... What I mean is: either perfs and memory matter and then you have to find an alternative because the level of waste in the default Java collections is way too high or you don't need perfs and memory doesn't matter, because you'll be using tiny number of elements and then the default Java collections are ok (tough there are probably better alternative like the Google collections etc.) – SyntaxT3rr0r Feb 10 '10 at 09:27
  • 3
    @WizardOfOdds: that's a lot of bold statements with little evidence to back them up. – skaffman Feb 10 '10 at 09:51
  • HashMap does not allow duplicate keys so using a hashmap to implement a set is a good idea. They might have wanted to reuse the existing code from HashMap. – fatma.ekici Jan 01 '13 at 21:15

7 Answers7

22

Actually, it's not just HashSet. All implementations of the Set interface in Java 6 are based on an underlying Map. This is not a requirement; it's just the way the implementation is. You can see for yourself by checking out the documentation for the various implementations of Set.

Your main questions are

But, why is it still used? Is there any reason to use it besides making it easier to maintain the codes?

I assume that code maintenance is a big motivating factor. So is preventing duplication and bloat.

Set and Map are similar interfaces, in that duplicate elements are not allowed. (I think the only Set not backed by a Map is CopyOnWriteArraySet, which is an unusual Collection, because it's immutable.)

Specifically:

From the documentation of Set:

A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.

The Set interface places additional stipulations, beyond those inherited from the Collection interface, on the contracts of all constructors and on the contracts of the add, equals and hashCode methods. Declarations for other inherited methods are also included here for convenience. (The specifications accompanying these declarations have been tailored to the Set interface, but they do not contain any additional stipulations.)

The additional stipulation on constructors is, not surprisingly, that all constructors must create a set that contains no duplicate elements (as defined above).

And from Map:

An object that maps keys to values. A map cannot contain duplicate keys; each key can map to at most one value.

If you can implement your Sets using existing code, any benefit (speed, for example) you can realize from existing code accrues to your Set as well.

If you choose to implement a Set without a Map backing, you have to duplicate code designed to prevent duplicate elements. Ah, the delicious irony.

That said, there's nothing preventing you from implementing your Sets differently.

nmeln
  • 495
  • 5
  • 17
JXG
  • 7,263
  • 7
  • 32
  • 63
  • 1
    "All implementations of the `Set` interface in Java 6 are based on an underlying `Collection`." (I assume you mean `Map` instead of `Collection`.) There exists at least one counter example (other than subsets and the like). `EnumSet` is not based on a `Map`. – Tom Hawtin - tackline Jan 06 '13 at 20:15
  • There's one more possibility: it could've been implemented as Map instead of Map and provide a get(T) for free at least for HashSet (and possibly TreeSet), similar to what C++ offers. It'd probably lead to some hacky usages (I cannot come up with a legit clean one currently anyway), but now and then it can get stuff done. – Luke Jul 22 '17 at 18:48
5

My guess is that HashSet was originally implemented in terms of HashMap in order to get it done quickly and easily. In terms of lines of code, HashSet is a fraction of HashMap.

I would guess that the reason it still hasn't been optimized is fear of change.

However, the waste is much worse than you think. On both 32-bit and 64-bit, HashSet is 4x larger than necessary, and HashMap is 2x larger than necessary. HashMap could be implemented with an array with keys and values in it (plus chains for collisions). That means two pointers per entry, or 16 bytes on a 64-bit VM. In fact, HashMap contains an Entry object per entry, which adds 8 bytes for the pointer to the Entry and 8 bytes for the Entry object header. HashSet also uses 32 bytes per element, but the waste is 4x instead of 2x since it only requires 8 bytes per element.

Craig P. Motlin
  • 26,452
  • 17
  • 99
  • 126
  • In the HotSpot JVM, an object header consists of two words, so a hash map entry with a pointer for the `key`, `value`, and a `next` entry for handling collisions has five times the space compared to a single reference in a flat array (when comparing with a possible `Set` implementation). But there’s still an array within `HashMap` too, the array of references to `Entry` instances. So in the end, the `HashMap` based `HashSet` takes roughly six times the space of a flat array based `HashSet`. On a 64 bit HotSpot JVM with CompressedOOPs and CompressedKlassPointers enabled, it’s even 6.5 times… – Holger Jul 07 '18 at 14:50
  • All the competitors, Eclipse Collections, Fastutils, Trove, etc. all achieve a 4x improvement. – Craig P. Motlin Jul 08 '18 at 23:01
  • That’s an empty statement without any mentioning of version numbers and particular JVM configuration. OpenJDK’s implementation has changed over time, most notably, the recent versions support a tree structure to handle collisions, which raises memory consumption even more, when it happens. Further, my previous comment already explained that there is a dependency to the JVM architecture and configuration when it comes to the object overhead. Of course, alternative implementations have to resort to objects as well for collisions.The authors likely made an understatement to dodged such subtleties – Holger Jul 09 '18 at 07:52
  • I am one of the authors. It’s not an understatement. 4x. All libraries, all versions. It’s been the same answer for 10+ years. – Craig P. Motlin Jul 09 '18 at 11:14
  • Well, in that case, you obviously ignored the fact that the JRE implementation of `HashMap` changed significantly during the last decade. And I don’t get why you insist so aggressively on rejecting the possibility that the improvement can be even better in certain scenarios. Is that “4x.” a parole given by some holy dictator that trumps every technical discussion or what? – Holger Jul 09 '18 at 11:18
  • We have tests to measure memory usage. Java 8 had significant changes to implementation that had all sorts of impact, but not to memory usage, except in the edge case where all collisions go into the same bucket. – Craig P. Motlin Jul 09 '18 at 11:28
  • So only “in the edge case where all collisions go into the same bucket”, it makes a difference? But, 10% collisions or 90% collisions, a load factor of 0.1 or 0.9, 32 bit JVM or 64 bit JVM, one thousand elements, one million elements or one billion elements, that all doesn’t matter, it’s always “4x. All libraries, all versions.”? Well, then this only suggests that there is still room for improvement. – Holger Jul 09 '18 at 12:00
3

I am guessing that it has never turned up as a significant problem for real applications or important benchmarks. Why complicate the code for no real benefit?

Also note, that object sizes are rounded up in many JVM implementation, so there may not actually be an increase in size (I don't know for this example). Also the code for HashMap is likely to be compiled and in cache. Other things being equal, more code => more cache misses => lower performance.

Suraj Chandran
  • 24,433
  • 12
  • 63
  • 94
Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
3

Yes you are right, a small amount of wastage is definetley there. Small because, for every entry it uses the same object PRESENT(which is declared final). Hence the only wastage is for every entry's value in the HashMap.

Mostly I think, they took this approach for maintainability and reusability. (The JCF developers would have thought, we have tested HashMap anyway, why not reuse it.)

But if you are having huge collections, and you are a memory freak, then you may opt out for better alternatives like Trove or Google Collections.

Suraj Chandran
  • 24,433
  • 12
  • 63
  • 94
  • Additional waste is having to store a reference to key, which can be large if you have millions of entries in the set. 8bytes * 1M objects = 8MB of waste – Yoni Roit Jul 21 '11 at 12:06
3

I looked at your question and it took me a while to think about what you said. So here's my opinion regarding the HashSet implementation.

It is necessary to have the dummy instance to know if the value is or is not present in the set.

Take a look at the add method

public boolean add(E e) {
return map.put(e, PRESENT)==null;
}

Abd now let's take a look at the put return value

@returns the previous value associated with key, or null if there was no mapping for key. (A null return can also indicate that the map previously associated null with key.)

So the PRESENT object is just used to represent that the set contains the e value. I think you asked why not use null instead of PRESENT. But the, you would not be able to distinguish if the entry was previously on the map because map.put(key,value) would always return null and you would not have way to know if the key existed.


That being said you could argue that they could have used an implementation like this

   public boolean add(E e) {

        if( map.containsKey(e) ) {
            return false;
        }

        map.put(e, null);

        return true;

}

I guess they waste 4 bytes to avoid computing the hashCode, as it could be expensive, of the key two times (if the key is going to be added).


If you question referred to why they used a HashMap that would waste 8 bytes (because of the Map.Entry) instead of some other data structure using a similar Entry of only 4, then yes, I would say they did it for the reasons you mentioned.

Lombo
  • 11,847
  • 2
  • 20
  • 27
0

After searching through pages like this wondering why the mildly inefficient standard implementation, found com.carrotsearch.hppc.IntOpenHashSet

clwhisk
  • 1,805
  • 1
  • 18
  • 17
-3

Your question: I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.

Just one Object variable is created for the entire datastructure of hashset and doing that would save yourself from re-writing the entire hashMap kind of code again.

private static final Object PRESENT = new Object();

All the keys are having one value i.e PRESENT object.

Srujan Kumar Gulla
  • 5,721
  • 9
  • 48
  • 78