Should a HashSet be allowed to be added to itself in Java?

Question

According to the contract for a Set in Java, "it is not permissible for a set to contain itself as an element" (source). However, this is possible in the case of a HashSet of Objects, as demonstrated here:

Set<Object> mySet = new HashSet<>();
mySet.add(mySet);
assertThat(mySet.size(), equalTo(1));

This assertion passes, but I would expect the behavior to be to either have the resulting set be 0 or to throw an Exception. I realize the underlying implementation of a HashSet is a HashMap, but it seems like there should be an equality check before adding an element to avoid violating that contract, no?

This will fail when you try to calculate the hash code of the set itself, because that will become an infinitely recursive call — Kon, Apr 19 '18 at 15:43
Please quote the full doc: "*Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A **special case** of this prohibition is that it is not permissible for a set to contain itself as an element.*" The problem is mutability. Just checking for `==` equality only takes care of a small fraction of not allowed cases. — Turing85, Apr 19 '18 at 15:43
The prohibition to not allow a set to contain itself is directed at the programmer, not at the class. It says "you, programmer, don't do that", not "you, class, don't allow that". — DwB, Apr 19 '18 at 15:48
@DwB: It's easy to see where the line blurs with Java, though, since Java is very largely restrictive in a large swath of internal functionality. — Makoto, Apr 19 '18 at 15:52
Related [math.se] question: [Why cannot a set be its own element?](https://math.stackexchange.com/questions/502259/why-cannot-a-set-be-its-own-element) Also, the [Barber Paradox](https://en.wikipedia.org/wiki/Barber_paradox) is very relevant here. — EJoshuaS - Stand with Ukraine, Apr 19 '18 at 15:54
@PatrickParker The OP didn't state that he actually *wants* to do that (the fact that he knew to ask this question kind of implies that he realizes that it's a bad idea to do that) - he's just asking why the code he shows actually works instead of throwing an exception or something like that. — EJoshuaS - Stand with Ukraine, Apr 19 '18 at 16:18
found similar question here : https://codegolf.stackexchange.com/a/21254 — dkb, Apr 19 '18 at 17:10
It is absolutely opinion-based. I think that undefined behaviors and unexpected runtime results should be avoided at all costs by throwing exceptions or even better, compilation errors, so that developers can find the bug and fix it in advance. — fps, Apr 19 '18 at 18:07
From the full doc above, the problem is handling mutable objects, and specifically depending on the implementation of equals() and getHashCode(). So, this would probably be fine if you create your own implementation of a set, that overrides equals() and getHashCode(), which doesn't rely on the mutable state. — Xantix, Apr 19 '18 at 20:13
@FedericoPeraltaSchaffner I don't think that this is POB - I think that answers can be reasonably backed up with facts and references. Something only becomes **primarily** opinion-based when you can no longer reasonably defend answers. I think that this question falls under the [constructive subjective questions guideline](https://stackoverflow.blog/2010/09/29/good-subjective-bad-subjective/). — EJoshuaS - Stand with Ukraine, Apr 19 '18 at 20:54
@EJoshuaS Well, the community seems to agree with you, and I'm fine with it :) But ask yourself OP's question: `Should a HashSet be allowed to be added to itself in Java?` My answer is an emphatic **NO**, because *I think* (and here's the opinion) that errors should occur as soon as possible during the development lifecycle. You also think that it shouldn't be allowed, but because of set's mathematical definition. And Makoto thinks it's OK as it is (his is the most upvoted answer). We all have our reasons backed by (solid?) arguments. — fps, Apr 19 '18 at 21:02
@FedericoPeraltaSchaffner True - conflicting answers backed by good arguments is actually a lot more common on some of the more "subjective" sites (e.g. [literature.se] and [scifi.se]) than it is here, but I don't think that it's intrinsically "bad." Each site has to decide exactly what their threshold of subjectivity is, but I think that some subjectivity can still lead to constructive Q&A as long as everyone is appealing to facts, reasonable arguments, professional experience, references, etc. (rather than just giving unsubstantiated opinions). — EJoshuaS - Stand with Ukraine, Apr 19 '18 at 21:13
@FedericoPeraltaSchaffner Arguing about how the program *should* behave may be opinion-based. (Although I agree that errors should show usually up as early as possible, for a lot of reasons: Even though the preferences here may be opinion-based, the pros and cons can still be stated objectively, to some extent). But for me, the (deeper) "core" of **this** question was why there is *no* simple equality check in the `add` method. And I tried to explain in my answer why this is not the case - mainly, because it could only prevent inconsistencies in the most trivial cases. — Marco13, Apr 19 '18 at 21:21

Marco13 · Accepted Answer · 2018-04-19T22:15:08.570

Others have already pointed out why it is questionable from a mathematical point of view, by referring to Russell's paradox.

This does not answer your question on a technical level, though.

So let's dissect this:

First, once more the relevant part from the JavaDoc of the Set interface:

Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.

Interestingly, the JavaDoc of the List interface makes a similar, although somewhat weaker, and at the same time more technical statement:

While it is permissible for lists to contain themselves as elements, extreme caution is advised: the equals and hashCode methods are no longer well defined on such a list.

And finally, the crux is in the JavaDoc of the Collection interface, which is the common ancestor of both the Set and the List interface:

Some collection operations which perform recursive traversal of the collection may fail with an exception for self-referential instances where the collection directly or indirectly contains itself. This includes the clone(), equals(), hashCode() and toString() methods. Implementations may optionally handle the self-referential scenario, however most current implementations do not do so.

^{(Emphasis by me)}

The bold part is a hint at why the approach that you proposed in your question would not be sufficient:

it seems like there should be an equality check before adding an element to avoid violating that contract, no?

This would not help you here. The key point is that you'll always run into problems when the collection will directly or indirectly contain itself. Imagine this scenario:

Set<Object> setA = new HashSet<Object>();
Set<Object> setB = new HashSet<Object>();
setA.add(setB);
setB.add(setA);

Obviously, neither of the sets contains itself directly. But each of them contains the other - and therefore, itself indirectly. This could not be avoided by a simple referential equality check (using == in the add method).

Avoiding such an "inconsistent state" is basically impossible in practice. Of course it is possible in theory, using referential Reachability computations. In fact, the Garbage Collector basically has to do exactly that!

But it becomes impossible in practice when custom classes are involved. Imagine a class like this:

class Container {

    Set<Object> set;

    @Override 
    int hashCode() {
        return set.hashCode(); 
    }
}

And messing around with this and its set:

Set<Object> set = new HashSet<Object>();
Container container = new Container();
container.set = set;
set.add(container);

The add method of the Set basically has no way of detecting whether the object that is added there has some (indirect) reference to the set itself.

Long story short:

You cannot prevent the programmer from messing things up.

Good explanation, thanks (and thanks to everyone else who joined this discussion). I hadn't considered the case where a set indirectly contains itself, which makes a check for this much less trivial. — davidmerrick, Apr 19 '18 at 22:27
And adding a check _just_ for the direct case feels "wrong" because it (slightly) slows down all correct code and _most_ programmers should be capable of avoiding an accidental, direct set-in-a-set case (once they know about it). — TripeHound, Apr 20 '18 at 10:04

score 23 · Answer 2 · answered Apr 19 '18 at 15:48

23

Adding the collection into itself once causes the test to pass. Adding it twice causes the StackOverflowError which you were seeking.

From a personal developer standpoint, it doesn't make any sense to enforce a check in the underlying code to prevent this. The fact that you get a StackOverflowError in your code if you attempt to do this too many times, or calculate the hashCode - which would cause an instant overflow - should be enough to ensure that no sane developer would keep this kind of code in their code base.

answered Apr 19 '18 at 15:48

Makoto

104,088
27
192
230

At the same time, if the class is supposed to be implementing a set, I feel like it really should conform to the basic rules of set theory. Modern set theory definitely does *not* permit sets to be members of themselves, so the fact that this is permitted really seems like a flaw in the implementation to me. – EJoshuaS - Stand with Ukraine Apr 19 '18 at 17:24
11

@EJoshuaS `java.util.Set` is not a mathematical set - for example, it may change over time, must be finite, etc. Mathematically, in ZFC a set can't be a member of itself due to the well-foundedness requirement but there are other formulations that allow it. https://en.wikipedia.org/wiki/Non-well-founded_set_theory – Reinstate Monica Apr 19 '18 at 17:28
@EJoshuaS: If you actually try to write code that somehow makes use of this set as a member of itself, you'll run into the runtime error. I don't see how that *doesn't* conform to set theory in that you get a state which is *technically* undefined when doing invalid operations on it. – Makoto Apr 19 '18 at 17:30
And if you enforce that a set cannot contain itself in `add`, you'd just turn one runtime error into a different runtime error. Fail-fast behavior for runtime errors is often desirable, but in this case you'd be adding a check to the common path for a very uncommon mistake that's going to blow up soon anyway. The original developer may have decided that wasn't worth it. – StackOverthrow Apr 19 '18 at 18:03
@Solomonoff'sSecret Why wouldn't you stick as closely as possible to ZFC though? Obviously the sets have to be finite (given that computers can only have a finite amount of memory), but it seems more intuitive to stick w/ ZFC if possible. You bring up a good point about the mutability issue. – EJoshuaS - Stand with Ukraine Apr 19 '18 at 19:24
5

@EJoshuaS Because it's a mainstream programming language's standard library. Well-foundedness, regularity, and all that nonsense has nothing to do with it. `java.util.Set` doesn't let a set contain itself for entirely practical reasons (for example, how do you compute the hash code of such a set?), not due to an obscure mathematical theory. – Reinstate Monica Apr 19 '18 at 19:47
@Solomonoff'sSecret Isn't the "turtles all the way down" problem with computing hash codes just another example of the problems with naive set theory, though? It seems like the practical issues with this just reflect the reasons that they had to replace naive set theory with ZFC in the first place. – EJoshuaS - Stand with Ukraine Apr 19 '18 at 20:07
@Solomonoff'sSecret It works if the hash function for sets is not very good - I've seen an implementation (not Java) where the hash of a set was just the number of its elements. Bugger. That means the hashcode changes while you insert the set into the set itself. You are right. Unless the hash is just a constant. – gnasher729 Apr 19 '18 at 21:55
@EJoshuaS I don't know if that is exactly the reason. The way I see it regularity requires all sets to be able to be formed "from the bottom up", which avoids a Russel's Paradox type situation. And here is an algorithm to compute hash code of a hierarchy of sets. Start at the top and work down recursively. Whenever you encounter a set contained in an ancestor, instead of computing its hash code, substitute 0; otherwise calculate it like Java's Set does. But the problem is that a Set could contain a List which contains an AtomicReference which contains the Set so it isn't a general solution. – Reinstate Monica Apr 19 '18 at 23:49
2

@EJoshuaS: Re: "Obviously the sets have to be finite (given that computers can only have a finite amount of memory)": The finite memory just means that the *representation* of the set has to be finite. If you wanted, you could design a class (or interface + implementing classes) that could represent things like "all integers", "all elements of *S* satisfying predicate *p*", "the union of *S* and *T*", and so on . . . not a full model of ZFC, but a potentially-useful approximation with a `contains` method. But that's not the purpose of `java.util.Set`! – ruakh Apr 20 '18 at 00:20
@ruakh I'd almost prefer to just use the notion of a lawlike choice sequence at that point - it's always seemed to be a better "fit" for that type of scenario to me given the limitation in the number of physical bits in the computer. I could be wrong though. – EJoshuaS - Stand with Ukraine Apr 20 '18 at 03:45
@EJoshuaS You might also notice that Java sets are mutable (usually) and mathematical sets are not. – user253751 Apr 20 '18 at 10:18
@ruakh Unfortunately a Goedel-like argument shows that if those predicates are expressive enough to encode Turing machines (which they almost certainly are in that context), even equality of those sets is undecidable: https://cs.stackexchange.com/questions/11916/show-that-it-is-undecidable-if-two-turing-machines-accept-the-same-language – Reinstate Monica Apr 20 '18 at 12:22

score 13 · Answer 3 · edited Apr 19 '18 at 19:07

You need to read the full doc and quote it fully:

The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.

The actual restriction is in the first sentence. The behavior is unspecified if an element of a set is mutated.

Since adding a set to itself mutates it, and adding it again mutates it again, the result is unspecified.

Note that the restriction is that the behavior is unspecified, and that a special case of that restriction is adding the set to itself.

So the doc says, in other words, that adding a set to itself results in unspecified behavior, which is what you are seeing. It's up to the concrete implementation to deal with (or not).

EJoshuaS - Stand with Ukraine · Answer 4 · 2018-04-20T15:20:59.263

I agree with you that, from a mathematical perspective, this behavior really doesn't make sense.

There are two interesting questions here: first, to what extent were the designers of the Set interface trying to implement a mathematical set? Secondly, even if they weren't, to what extent does that exempt them from the rules of set theory?

For the first question, I will point you to the documentation of the Set:

A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.

It's worth mentioning here that current formulations of set theory don't permit sets to be members of themselves. (See the Axiom of regularity). This is due in part to Russell's Paradox, which exposed a contradiction in naive set theory (which permitted a set to be any collection of objects - there was no prohibition against sets including themselves). This is often illustrated by the Barber Paradox: suppose that, in a particular town, a barber shaves all of the men - and only the men - who do not shave themselves. Question: does the barber shave himself? If he does, it violates the second constraint; if he doesn't, it violates the first constraint. This is clearly logically impossible, but it's actually perfectly permissible under the rules of naive set theory (which is why the newer "standard" formulation of set theory explicitly bans sets from containing themselves).

There's more discussion in this question on Math.SE about why sets cannot be an element of themselves.

With that said, this brings up the second question: even if the designers hadn't been explicitly trying to model a mathematical set, would this be completely "exempt" from the problems associated with naive set theory? I think not - I think that many of the problems that plagued naive set theory would plague any kind of a collection that was insufficiently constrained in ways that were analogous to naive set theory. Indeed, I may be reading too much into this, but the first part of the definition of a Set in the documentation sounds suspiciously like the intuitive concept of a set in naive set theory:

A collection that contains no duplicate elements.

Admittedly (and to their credit), they do place at least some constraints on this later (including stating that you really shouldn't try to have a Set contain itself), but you could question whether it's really "enough" to avoid the problems with naive set theory. This is why, for example, you have a "turtles all the way down" problem when trying to calculate the hash code of a HashSet that contains itself. This is not, as some others have suggested, merely a practical problem - it's an illustration of the fundamental theoretical problems with this type of formulation.

As a brief digression, I do recognize that there are, of course, some limitations on how closely any collection class can really model a mathematical set. For example, Java's documentation warns against the dangers of including mutable objects in a set. Some other languages, such as Python, at least attempt to ban many kinds of mutable objects entirely:

The set classes are implemented using dictionaries. Accordingly, the requirements for set elements are the same as those for dictionary keys; namely, that the element defines both __eq__() and __hash__(). As a result, sets cannot contain mutable elements such as lists or dictionaries. However, they can contain immutable collections such as tuples or instances of ImmutableSet. For convenience in implementing sets of sets, inner sets are automatically converted to immutable form, for example, Set([Set(['dog'])]) is transformed to Set([ImmutableSet(['dog'])]).

Two other major differences that others have pointed out are

Java sets are mutable
Java sets are finite. Obviously, this will be true of any collection class: apart from concerns about actual infinity, computers only have a finite amount of memory. (Some languages, like Haskell, have lazy infinite data structures; however, in my opinion, a lawlike choice sequence seems like a more natural way model these than classical set theory, but that's just my opinion).

TL;DR No, it really shouldn't be permitted (or, at least, you should never do that) because sets can't be members of themselves.

Regularity has absolutely nothing to do with the issue here; the problem is with *unrestricted comprehension* -- in programming terms, that is the axiom that every boolean function defines a set. — , Apr 19 '18 at 20:39
AFAIK, the *actual* reason to ask for regularity is so that you can make (transfinite) inductive arguments or recursive definitions over the ∈ relation. — , Apr 19 '18 at 20:40
I should clarify that the theory of computation avoids the problem with unrestricted comprehension in a completely different way than the typical approaches to set theory -- it relaxes what it means to be a "boolean function" to allow a third alternative for the result. (the function never halts) — , Apr 19 '18 at 20:46
@Hurkyl I was referring to the axiom of regularity mostly to back up my claim that ZFC doesn't permit sets to be an element of themselves. My point is that if you allow sets to be members of themselves you run into all kinds of problems (e.g. the "turtles all the way down" problem if you try to calculate the hash code, as Makoto referred to in his answer). These problems are examples of why naive set theory is terrible, though. — EJoshuaS - Stand with Ukraine, Apr 19 '18 at 20:52
Please note the comment on your linked question: "Russell's paradox does not prevent a set from being its own element". Moreover I'd argue that, although there is a deliberate similarity between mathematical sets and Java sets, there are too many differences for this to be a useful guide. eg you want to compute hash codes for the latter, which already rules out infinite sets allowed by the former. — stewbasic, Apr 20 '18 at 04:39
There are good arguments to be made against (programming) sets containing themselves but "because (mathematical) sets can't be members of themselves" is not one of them. — Chris H, Apr 20 '18 at 07:21
If you were using a math-library you *might* have a point, but a set as in one type of collection as used in programming languages is clearly distinct from a mathematical set. It bears *some* resemblance, but that is all. No set can be infinite in most programming languages, and computing most stuff on a lazy infinite set in Haskell is often futile. — Polygnome, Apr 20 '18 at 08:16
@Polygnome The mere fact that it's permitted to contain itself causes it to suffer from the same problems as naive set theory if you try to do that, and for many of the same reasons. For example, the fact that you'd get infinite recursion when trying to compute the hash code of this is not an incidental practical problem as others have suggested but a reflection of the theoretical problems associated with that kind of a construct. — EJoshuaS - Stand with Ukraine, Apr 20 '18 at 13:31
@EJoshuaS You get the same problem when you have a tree that contains itself. Self-rerential data structures are always problematic when not dealt with specifically. Again, Sets (especially in Java) are *not* mathematical sets. For example, most of them are finite (`Integer.MAX_VALUE` for `HashSet`). You *can* solve self-referential problems if you specify the behavior - but the docs just say its unspecified here. — Polygnome, Apr 20 '18 at 13:36
@immibis The [documentation](https://docs.oracle.com/javase/7/docs/api/java/util/Set.html) *explicitly* states that the `Set` interface is trying to model mathematical sets: "As implied by its name, this interface models the mathematical set abstraction." — EJoshuaS - Stand with Ukraine, Apr 20 '18 at 14:58
@ChrisH I disagree - the [documentation](https://docs.oracle.com/javase/7/docs/api/java/util/Set.html) states the following: "As implied by its name, [the `Set`] interface models the mathematical set abstraction." That being said, I think that my argument is reasonable. — EJoshuaS - Stand with Ukraine, Apr 20 '18 at 15:16
@EJoshuaS hm, ok. I'm familiar with sets as a general programming concept and have used them in Java, but never looked at the documentation too closely. I still don't think it's a valid argument for (programming) sets in the general case, but if the Java docs claim it is intended to model mathematical sets then I guess it's a fair point for Java. — Chris H, Apr 20 '18 at 21:03

Should a HashSet be allowed to be added to itself in Java?

4 Answers4