5

I need to ensure that a certain Set<String> I create is not modified somewhere else in the code. Of course, I ended up using Guava's ImmutableSet for this.

This immutable set is quite large (approx. 59K Strings), and I have to perform a Set#contains check every time a particular method is called. So I was wondering if there is any way of specifying the look-up in big set. Guava's documentation says:

A high-performance, immutable Set with reliable, user-specified iteration order. Does not permit null elements.

What does user-specified iteration mean if the immutable set is created by calling ImmutableSet#copyOf(aHashSet)? Will the performance of contains(String) be adversely affected if I use ImmutableSet#contains instead of HashSet#contains? To be more precise, my question is the following:

With a decent hash function and not too many elements getting in the same bucket, one would expect HashSet#contains to be O(1). Will an ImmutableSet created using copyOf adhere to this?

There are two reasons behind my suspicion that this might not be the case:

  1. Guava forum discussion on precisely this question (didn't seem to provide a conclusive answer though).

  2. It's not clear to me whether ImmutableSet#contains defers to java.util.Set#contains (i.e., the implementation in HashSet, in my case) or com.google.common.collect.ImmutableCollection#contains. If it is the latter, then ImmutableSet#contains will be an O(n) operation.

ColinD
  • 108,630
  • 30
  • 201
  • 202
Chthonic Project
  • 8,216
  • 1
  • 43
  • 92

2 Answers2

4

The only confirmation I see in the documentation is the following:

this class's factory methods create hash-based instances, ...

In other words, you can expect lookups to use a hashing mechanism (and thus have performance characteristics) similar to HashSet. The docs are deliberately vague so that various improvements can be made (for example, using a special implementation for certain special cases, like a singleton or empty set).

The iteration order will depend on the method of creation. In the case of copyOf, it will be the iteration order of the Iterable you passed in (at the time the copy is made, of course). This is strongly documented:

Returns an immutable set containing the given elements, in order.

As to whether it defers to the set's contains method, no. Because ImmutableSet makes a copy (unlike Collections.unmodifiableSet()), it clearly cannot defer to the original set for any operations.

Mark Peters
  • 80,126
  • 17
  • 159
  • 190
  • Yes, the iteration will be in the order of the iterable that was passed. But my confusion is between that, and the [following statement](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/ImmutableCollection.html#contains(java.lang.Object)) "This implementation iterates over the elements in the collection, checking each element in turn for equality with the specified element." Does this mean an O(n) check for contains even if the original set was a HashSet, because ImmutableSet inherits the contains method from ImmutableCollection. – Chthonic Project Feb 08 '15 at 08:16
  • No, it doesn't. "this implementation" refers to that implementation only, i.e. `ImmutableCollection`. `ImmutableSet`s provide their own implementation of `contains`. – Mark Peters Feb 08 '15 at 17:43
  • Ah, I see. I was looking at [this doc](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/ImmutableSet.html), which made me think otherwise. – Chthonic Project Feb 08 '15 at 20:31
  • 1
    @ChthonicProject: I understand the confusion. It arises from the fact that while (the abstract) `ImmutableSet` does not override `contains` (and thus inherits the method/Javadoc from `ImmutableCollection`) all of its concrete implementations (which are encapsulated, and thus not documented) *do* override `contains`. If you feel like the documentation around `ImmutableSet` can be improved, I'm sure the Guava team would welcome suggestions for improvement. – Mark Peters Feb 09 '15 at 22:43
4

Just a small addition to Mark Peters' answer.

With RegularImmutableSet the order gets preserved by storing the elements twice (once ordered, once hashed). This is still cheaper than the original HashSet which delegates to HashMap which creates an entry for each element stored.

There are optimized implementations SingletonImmutableSet and EmptyImmutableSet. And also many others which get used when you start with an immutable collection or map.

Use the source if you want to know more (but depend on the documentation only).

The performance discussion you linked only deals with hash collisions. Normally, the performance is O(1), just in case of a really bad hash function, it degenerates. This holds for all hashing data structures, but the effects differ. RegularImmutableSet has better data locality, HashSet uses chaining and can better handling conflicts.

There used to be a problem, where some kind of conflicts lead to an excessive number of collisions, but it's been fixed long time ago. Now, it's rather impossible to run into something similar by accident.

Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320