0

Based on the accepted answer in this question: How does Set ensure equatability in Swift?
The hashValue is used for the first test for uniqueness. If the hashValue matches another element's hashValue, then == is used as the backup testing.

But still, behind the scene Set has to store a unique identifier for each element. Consider this example:

struct Country {
    let name: String
    let capital: String
}

extension Country: Hashable {
    static func == (lhs: Country, rhs: Country) -> Bool {
        return lhs.name == rhs.name && lhs.capital == rhs.capital
    }

    var hashValue: Int {
        return name.hashValue ^ capital.hashValue
    }
}

let singapore = Country(name: "Singapore", capital: "Singapore")
let monaco = Country(name: "Monaco", capital: "Monaco")
singapore.hashValue // returns 0
monaco.hashValue // returns 0


var countries: Set<Country> = []
countries.insert(singapore)
countries.insert(monaco)

countries // Contains both singapore and monaco

As you can see, some countries have the same name as of their capitals. And this will generate hashValue collision. The set will run more expensive == to determine its uniqueness which might be not O(1). But after doing this comparison, Set has to generate the unique identifier for this element to store behind the scene.

The question: How does set generate unique identifier for collided element like this?

Edward Anthony
  • 3,354
  • 3
  • 25
  • 40
  • What happens when you print the hashValues of the individual String members? – Carlos Aug 11 '17 at 08:17
  • 2
    Source code available for viewing: https://github.com/apple/swift-corelibs-foundation/tree/19249417b01573bd6aa32b9a24cc42273315a48b/Foundation – Scroog1 Aug 11 '17 at 08:22
  • This situation is called "hash collision". The set in that situation, instead og having just one object for that hashValue, would create an another collection, like inner Set, and store the objects there. – Kostiantyn Koval Aug 11 '17 at 08:24
  • @Carlos those are normal, that's not the problem. I used XOR to generate the `hashValue`. If you XOR two same values, it will always generate 0. I want to know how Swift generate unique identifier for collided element like this. – Edward Anthony Aug 11 '17 at 08:32
  • 1
    Why would it need to generate a unique identifier? Hash collision is a known problem and if you are designing your own class that conforms to the `Hashable` protocol, you are responsible for making the probability of a hash collision as small as possible. As you have already stated in your question, `Swift` has a built in mechanism for checking uniqueness even if there is a hash collision, so there is no need to generate another "unique" identifier... – Dávid Pásztor Aug 11 '17 at 08:59
  • @DávidPásztor that mechanism is my question. I'm sure behind the scene Set will generate unique identifier to replace collided `hashValue`, there are a couple of reasons why Swift need this UID. First, for fast access, this could improve performance when accessing the element, second it's for preventing another collision, so the element that was already collided won't collide again, because `=` is expensive. – Edward Anthony Aug 11 '17 at 09:07
  • Did you even read my comment? As I have already stated, there is no need for another unique identifier, this is what the hash value is for. The built in types conforming to `Hashable` provide a hash value that is quite unlikely to collide, so there is a really small chance that the equality operator will ever have to be used on the elements. And if you are designing your own class conforming to `Hashable`, `Swift` cannot and will not guarantee the O(1) access time. If you cannot write a good enough hash algorithm, its not the compilers problem... – Dávid Pásztor Aug 11 '17 at 09:13
  • @DávidPásztor Data is stored as key value pair. In case of `Dictionary` the key is the `Hashable` object. And in case of `Array`, the key is Integer, and in case of `Set` the key is `hashValue` of the elements. The reason we can iterate the array or dictionary is because each element of array or dictionary have key. So when we iterate it, the collection knows what it has to return for the next iteration (next method in IteratorProtocol). The same applies with `Set`, without a key, we can't iterate it. CMIIW. So, Set has to specify the identifier, so it can access the element during iteration. – Edward Anthony Aug 11 '17 at 09:40
  • 1
    _I'm sure behind the scene Set will generate unique identifier to replace collided hashValue_ I do not understand why you can be sure even if I read your two reasons. You need to show some steady verifiable proof, not your feeling nor intuition. As far as I tested, hash collision just increases the number of calls to `==`. – OOPer Aug 11 '17 at 09:40
  • Seems you need to know one more thing. Hash-based Set is very similar to a Dictionary of type `[Element: Void]`, each element itself is Key. – OOPer Aug 11 '17 at 09:44
  • @EdwardAnthony yes, and that key is the hash-value or the value of the element itself in case of a hash collision. But if you are so sure you are right, why don't you just have a look at the implementation yourself? Swift is open source, so you can check it out. – Dávid Pásztor Aug 11 '17 at 09:53
  • @DávidPásztor Sure, been digging into the open source code since yesterday. Maybe I can get some explanation from people who've jumped into the Swift's stdlib before :D. Thanks for the help anyway :) – Edward Anthony Aug 11 '17 at 09:59
  • @EdwardAnthony I think your question would be better phrased as simply "how does `Set` resolve hash collisions?" (if this is indeed your question). For native storage, [linear probing](https://en.wikipedia.org/wiki/Linear_probing) is used. – Hamish Aug 11 '17 at 10:38
  • @Hamish You're right. Thank you. I updated the title. – Edward Anthony Aug 11 '17 at 11:06
  • @Hamish As I expected, it does generate a key. You're right, it turns out to be linear probing. https://github.com/apple/swift/blob/master/stdlib/public/core/HashedCollections.swift.gyb Thanks for the help. Please write an answer, I'll mark it as accepted. – Edward Anthony Aug 11 '17 at 11:15

1 Answers1

1

It seems that hash value is used just to identify the bucket to use to insert the element internally (hash is not stored), but uses == to compare if the element is used. And also needs to rehash all the elements if the collection storage grew.

You can get more information in a discussion here.

Dario
  • 3,105
  • 1
  • 17
  • 16