Hash function on list independant of order of items in it

Question

I want to have a dictionary that assigns a value to a set of integers.

For example key is [1 2 3] and value will have certain value.

The thing is that [3 2 1] needs to be treated the same in my case so hash needs to be equal, if I go with hash approach.

The set will have 2 to 10 items.

Sum of items is usually fixed so we cannot make hashcode according to sum, which is a first natural idea here.

Not a homework task, actually facing this problem in my code.

This set is basically IEnumerable<int> in C# so any data structure is fine to store them.

Any help appreciated. Performance is pretty important here too.

An immediate thought: we could sum up items^2 and already get some kind of better hash, but still I would like to hear some thoughts.

EDIT: hmm really sorry guys, everyone suggests ordering, didn't come to my mind that I needed to say that actually ordering and hashing is the current solution I use and I am considering faster alternatives.

Have you considered using an ordered set as the key instead of ienumerable? — asawyer, Nov 18 '11 at 20:47
ordering is expensive so yes I have but it doesn't fit desired performance, id rather not sort 10 items before hashing them. — Valentin Kuzub, Nov 18 '11 at 20:48
@DanielA.White well it all depends on definition of performance I guess. If I could avoid checks & swaps required to sort 10 items and hash immediately with good distribution obviously it will be better, right? — Valentin Kuzub, Nov 18 '11 at 21:01
@HenkHolterman items can be between 1 and 300000 roughly, sum is between 10000 and 10000000 roughly — Valentin Kuzub, Nov 18 '11 at 21:02
since ideally the key will be immutable, you could calculate it once and store the result. — Daniel A. White, Nov 18 '11 at 21:06
had a thought like that key is indeed immutable, however this lists/sets are generated somewhere else , they arent flyweights and storing a calculated hash key is going to be useless cause it will not get called more than 1 time usually. — Valentin Kuzub, Nov 18 '11 at 21:08
@HenkHolterman also items are often distinct (like 95%+ of the time) — Valentin Kuzub, Nov 18 '11 at 21:13
With a 300k range and small sets (~10) I would stop worrying and simply sum the items. You're not going to completely avoid collisions anyway and the rate won't be bad. — H H, Nov 18 '11 at 21:19
sum is usually a constant , like I say in question: Sum of items is usually fixed, so adding them up is guaranteed collision. It can simply be in that range, but at the time function operates it will usually deal with a big bunch of sets with same sum. — Valentin Kuzub, Nov 18 '11 at 21:22

Per · Accepted Answer · 2011-11-18T23:35:21.613

Basically all of the approaches here are instantiations of the same template. Map x₁, …, x_n to f(x₁) op … op f(x_n), where op is a commutative associative operation on some set X, and f is a map from items to X. This template has been used a couple of times in ways that are provably good.

Choose a random large prime p and a random residue b in [1, p - 1]. Let f(x) = b^x mod p and let op be addition. We essentially interpret a set as a polynomial and use the Schwartz–Zippel lemma to bound the probability of a collision (= the probability that a nonzero polynomial has b as a root mod p).
Let op be XOR and let f be a randomly chosen table. This is Zobrist hashing and minimizes in expectation the number of collisions by straightforward linear-algebraic arguments.

Modular exponentiation is slow, so don't use it. As for Zobrist hashing, with 3 million items, the table f probably won't fit into L2, though it does set an upper bound of one main-memory access.

I would instead take Zobrist hashing as a departure point and look for a cheap function f that behaves like a random function. This is essentially the job description of a non-cryptographic pseudorandom generator – I would try computing f by seeding a fast PRG with x and generating one value.

EDIT: given that the sets all have the same sums, don't choose f to be a degree 1 polynomial (e.g., the step function of a linear congruential generator).

Bloom filters could be thought of as another "hash function" for sets, though this certainly is not their primary use. Here, op = bitwise OR and f(x) is a sparse 0-1 bit array. — Per, Nov 18 '11 at 22:31
@Henk Holterman I have no idea what the scare quotes are for (provable is provable), but I put in a note about not using a degree-1 polynomial for f. — Per, Nov 18 '11 at 23:36

score 2 · Answer 2 · answered Nov 18 '11 at 20:50

2

Use a HashSet<T> and HashSet<T>.CreateSetComparer(), which returns an IEqualityComparer<HashSet<T>>.

answered Nov 18 '11 at 20:50

SLaks

868,454
176
1,908
1,964

1

just had a thought like this . but there is a chance it sorts items or isn't very performant I believe. I don't know which algorithm it uses, do you? – Valentin Kuzub Nov 18 '11 at 21:04
1

checked the code, doesn't look very good to me : foreach (T local in obj) { num ^= this.m_comparer.GetHashCode(local) & 0x7fffffff; } – Valentin Kuzub Nov 18 '11 at 21:32
@ValentinKuzub, it uses `XOR` of all elements (.NET 4.0) – Ivan Bianko Nov 18 '11 at 21:34

score 2 · Answer 3 · answered Nov 18 '11 at 20:56

2

I think what is mentioned in this paper would definitely help:

http://people.csail.mit.edu/devadas/pubs/mhashes.pdf

Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking

Abstract: We introduce a new cryptographic tool: multiset hash functions. Unlike standard hash functions which take strings as input, multiset hash functions operate on multisets (or sets). They map multisets of arbitrary ﬁnite size to strings (hashes) of ﬁxed length. They are incremental in that, when new members are added to the multiset, the hash can be updated in time proportional to the change. The functions may be multiset-collision resistant in that it is diﬃcult to ﬁnd two multisets which produce the same hash, or just set-collision resistant in that it is diﬃcult to ﬁnd a set and a multiset which produce the same hash.

answered Nov 18 '11 at 20:56

derekhh

5,272
11
40
60

From your description it appears that main focus is to have function that is cumulative in a way that if set is increased in size, hash doesn't need to be fully recalcuated. I am not sure it applies to my problem or? – Valentin Kuzub Nov 18 '11 at 21:20
2

I would think that crypto-grade hashes are too slow. – Per Nov 18 '11 at 21:21
@ValentinKuzub: another important feature of the hashing functions mentioned in this paper is that they are defined on sets instead of strings, hence making the values invariant from the ordering of elements in the set, IMHO – derekhh Nov 18 '11 at 21:31
@Per: Yes, that is indeed a problem...btw, are you Per Austrin? – derekhh Nov 18 '11 at 21:34
@derekhh No, I'm not Per Austrin. – Per Nov 18 '11 at 21:44

score 2 · Answer 4 · answered Nov 18 '11 at 20:59

2

I think your squaring idea is going in the right direction, but a poor choice of function. I'd try something more like the PRNG functions or just multiplication by a large prime, followed by XOR of all the resulting values.

answered Nov 18 '11 at 20:59

phkahler

5,687
1
23
31

score 1 · Answer 5 · answered Mar 24 '17 at 21:29

If the range of the values in key happens to be limited to low-ish positive integers, you could map each one to a prime number using a simple lookup, then multiply them together to arrive at the value.

Using the example in the question:

[1, 2, 3] maps to 2 x 3 x 5 = 30
[3, 2, 1] maps to 5 x 3 x 2 = 30

score 0 · Answer 6 · answered Nov 18 '11 at 20:48

0

One possibility: sort the items in the list, then hash that.

answered Nov 18 '11 at 20:48

Joe

3,804
7
35
55

score 0 · Answer 7 · answered Nov 18 '11 at 20:50

0

You could sort the numbers and select a sample from predetermined indices and leave rest as zero if current value has less numbers. Or you could xor them, or whatever.

answered Nov 18 '11 at 20:50

perreal

94,503
21
155
181

score 0 · Answer 8 · answered Nov 18 '11 at 21:43

0

Why not something like

public int GetOrderIndependantHashCode(IEnumerable<int> source)
{
    return (source.Select(x => x*x).Sum()
            + source.Select(x => x*x*x).Sum()
            + source.Select(x => x*x*x*x).Sum()) & 0x7FFFFF;
}

answered Nov 18 '11 at 21:43

Ivan Bianko

1,749
15
22

remember that we are battling sort approach. we got a lot of multiplications and summations going on here, sorting might outmatch this. – Valentin Kuzub Nov 18 '11 at 21:58

score -1 · Answer 9 · answered Nov 18 '11 at 20:48

-1

Create your own type that implements IEnumerable<T>.

Override GetHashCode. In it, sort your collection, call and return ToArray().GetHashCode().

answered Nov 18 '11 at 20:48

Daniel A. White

187,200
47
362
445

Hash function on list independant of order of items in it

9 Answers9

Linked