Algorithm for hash/crc of unordered multiset

Question

Let's say I would like to create a unordered set of unordered multisets of unsigned int. For this, I need to create a hash function to calculate a hash of the unordered multiset. In fact, it has to be good for CRC as well.

One obvious solution is to put the items in vector, sort them and return a hash of the result. This seems to work, but it is expensive.

Another approach is to xor the values, but obviously if I have one item twice or none the result will be the same - which is not good.

Any ideas how I can implement this cheaper - I have an application that will be doing this thousand for thousands of sets, and relatively big ones.

Can you modify the multisets so that they recompute their hashes on insertion / removal? Then if you need to do lookups multiple times you don't have to keep recomputing the hashes. — Alex Reinking, Apr 09 '16 at 17:56
Because the cached value could simply be *read*, you wouldn't need to compute it thousands of times. — Alex Reinking, Apr 10 '16 at 01:54
I am having thousands of set, I do not do this for the same set again and again. — gsf, Apr 10 '16 at 02:02

score 2 · Answer 1 · answered Apr 10 '16 at 03:46

Since it is a multiset, you would like for the hash value to be the same for identical multisets, whose representation might have the same elements presented, added, or deleted in a different order. You would then like for the hash value to be commutative, easy to update, and change for each change in elements. You would also like for two changes to not readily cancel their effect on the hash.

One operation that meets all but the last criteria is addition. Just sum the elements. To keep the sum bounded, do the sum modulo the size of your hash value. (E.g. modulo 2⁶⁴ for a 64-bit hash.) To make sure that inserting or deleting zero values changes the hash, add one to each value first.

A drawback of the sum is that two changes can readily cancel. E.g. replacing 1 3 with 2 2. To address that, you can use the same approach and sum a polynomial of the entries, still retaining commutativity. E.g. instead of summing x+1, you can sum x²+x+1. Now it is more difficult to contrive sets of changes with the same sum.

is this correct, though. for example for 16 bits if I start with 0xFFFF, if add another 0xFFFF, 0xFFFF + 0xFFFF = 0x7FFF, then if I remove it 0x7FFF - 0xFFFF = 0x7FFF - the inital and the end value are not the same. — gsf, Apr 10 '16 at 04:17
Modulo 2^16: 0xFFFF + 0xFFFF = 0xFFFE, and 0x7FFF - 0xFFFF = 0x8000. And of course, 0xFFFE - 0xFFFF = 0xFFFF. — Mark Adler, Apr 10 '16 at 14:21

Alex Reinking · Answer 2 · 2016-04-10T19:51:48.173

Here's a reasonable hash function for std::unordered_multiset<int> it would be better if the computations were taken mod a large prime but the idea stands.

#include <iostream>
#include <unordered_set>

namespace std {
    template<>
    struct hash<unordered_multiset<int>> {
        typedef unordered_multiset<int> argument_type;
        typedef std::size_t result_type;

        const result_type BASE = static_cast<result_type>(0xA67);

        result_type log_pow(result_type ex) const {
            result_type res = 1;
            result_type base = BASE;
            while (ex > 0) {
                if (ex % 2) {
                    res = res * base;
                }
                base *= base;
                ex /= 2;
            }
            return res;
        }

        result_type operator()(argument_type const & val) const {
            result_type h = 0;
            for (const int& el : val) {
                h += log_pow(el);
            }
            return h;
        }
    };
};

int main() {
    std::unordered_set<std::unordered_multiset<int>> mySet;
    std::unordered_multiset<int> set1{1,2,3,4};
    std::unordered_multiset<int> set2{1,1,2,2,3,3,4,4};
    std::cout << "Hash 1: " << std::hash<std::unordered_multiset<int>>()(set1) 
              << std::endl;
    std::cout << "Hash 2: " << std::hash<std::unordered_multiset<int>>()(set2) 
              << std::endl;
    return 0;
}

Output:

Hash 1: 2290886192
Hash 2: 286805088

When it's a prime p, the number of collisions is proportional to 1/p. I'm not sure what the analysis is for powers of two. You can make updates to the hash efficient by adding/subtracting BASE^x when you insert/remove the integer x.

Stefan Haustein · Answer 3 · 2016-04-10T20:44:51.017

0

Implement the inner multiset as a value->count hash map.

This will allow you to avoid the problem that an even number of elements cancels out via xor in the following way: Instead of xor-ing each element, you construct a new number from the count and the value (e.g. multiplying them), and then you can build the full hash using xor.

edited Apr 10 '16 at 20:44

answered Apr 09 '16 at 22:11

Stefan Haustein

18,427
3
36
51

Algorithm for hash/crc of unordered multiset

3 Answers3