4

I want to have a dictionary that assigns a value to a set of integers.

For example key is [1 2 3] and value will have certain value.

The thing is that [3 2 1] needs to be treated the same in my case so hash needs to be equal, if I go with hash approach.

The set will have 2 to 10 items.

Sum of items is usually fixed so we cannot make hashcode according to sum, which is a first natural idea here.

Not a homework task, actually facing this problem in my code.

This set is basically IEnumerable<int> in C# so any data structure is fine to store them.

Any help appreciated. Performance is pretty important here too.

An immediate thought: we could sum up items^2 and already get some kind of better hash, but still I would like to hear some thoughts.

EDIT: hmm really sorry guys, everyone suggests ordering, didn't come to my mind that I needed to say that actually ordering and hashing is the current solution I use and I am considering faster alternatives.

Otiel
  • 18,404
  • 16
  • 78
  • 126
Valentin Kuzub
  • 11,703
  • 7
  • 56
  • 93
  • Have you considered using an ordered set as the key instead of ienumerable? – asawyer Nov 18 '11 at 20:47
  • ordering is expensive so yes I have but it doesn't fit desired performance, id rather not sort 10 items before hashing them. – Valentin Kuzub Nov 18 '11 at 20:48
  • 1
    sorting is not bad for 10 items. – Daniel A. White Nov 18 '11 at 20:58
  • 1
    What will be the typical range of your values? – H H Nov 18 '11 at 20:58
  • @DanielA.White well it all depends on definition of performance I guess. If I could avoid checks & swaps required to sort 10 items and hash immediately with good distribution obviously it will be better, right? – Valentin Kuzub Nov 18 '11 at 21:01
  • @HenkHolterman items can be between 1 and 300000 roughly, sum is between 10000 and 10000000 roughly – Valentin Kuzub Nov 18 '11 at 21:02
  • since ideally the key will be immutable, you could calculate it once and store the result. – Daniel A. White Nov 18 '11 at 21:06
  • had a thought like that key is indeed immutable, however this lists/sets are generated somewhere else , they arent flyweights and storing a calculated hash key is going to be useless cause it will not get called more than 1 time usually. – Valentin Kuzub Nov 18 '11 at 21:08
  • @HenkHolterman also items are often distinct (like 95%+ of the time) – Valentin Kuzub Nov 18 '11 at 21:13
  • With a 300k range and small sets (~10) I would stop worrying and simply sum the items. You're not going to completely avoid collisions anyway and the rate won't be bad. – H H Nov 18 '11 at 21:19
  • sum is usually a constant , like I say in question: Sum of items is usually fixed, so adding them up is guaranteed collision. It can simply be in that range, but at the time function operates it will usually deal with a big bunch of sets with same sum. – Valentin Kuzub Nov 18 '11 at 21:22

9 Answers9

6

Basically all of the approaches here are instantiations of the same template. Map x1, …, xn to f(x1) op … op f(xn), where op is a commutative associative operation on some set X, and f is a map from items to X. This template has been used a couple of times in ways that are provably good.

  • Choose a random large prime p and a random residue b in [1, p - 1]. Let f(x) = bx mod p and let op be addition. We essentially interpret a set as a polynomial and use the Schwartz–Zippel lemma to bound the probability of a collision (= the probability that a nonzero polynomial has b as a root mod p).

  • Let op be XOR and let f be a randomly chosen table. This is Zobrist hashing and minimizes in expectation the number of collisions by straightforward linear-algebraic arguments.

Modular exponentiation is slow, so don't use it. As for Zobrist hashing, with 3 million items, the table f probably won't fit into L2, though it does set an upper bound of one main-memory access.

I would instead take Zobrist hashing as a departure point and look for a cheap function f that behaves like a random function. This is essentially the job description of a non-cryptographic pseudorandom generator – I would try computing f by seeding a fast PRG with x and generating one value.

EDIT: given that the sets all have the same sums, don't choose f to be a degree 1 polynomial (e.g., the step function of a linear congruential generator).

Per
  • 2,594
  • 12
  • 18
  • Bloom filters could be thought of as another "hash function" for sets, though this certainly is not their primary use. Here, op = bitwise OR and f(x) is a sparse 0-1 bit array. – Per Nov 18 '11 at 22:31
  • @Henk Holterman I have no idea what the scare quotes are for (provable is provable), but I put in a note about not using a degree-1 polynomial for f. – Per Nov 18 '11 at 23:36
2

Use a HashSet<T> and HashSet<T>.CreateSetComparer(), which returns an IEqualityComparer<HashSet<T>>.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
2

I think what is mentioned in this paper would definitely help:

http://people.csail.mit.edu/devadas/pubs/mhashes.pdf

Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking

Abstract: We introduce a new cryptographic tool: multiset hash functions. Unlike standard hash functions which take strings as input, multiset hash functions operate on multisets (or sets). They map multisets of arbitrary finite size to strings (hashes) of fixed length. They are incremental in that, when new members are added to the multiset, the hash can be updated in time proportional to the change. The functions may be multiset-collision resistant in that it is difficult to find two multisets which produce the same hash, or just set-collision resistant in that it is difficult to find a set and a multiset which produce the same hash.

derekhh
  • 5,272
  • 11
  • 40
  • 60
  • From your description it appears that main focus is to have function that is cumulative in a way that if set is increased in size, hash doesn't need to be fully recalcuated. I am not sure it applies to my problem or? – Valentin Kuzub Nov 18 '11 at 21:20
  • 2
    I would think that crypto-grade hashes are too slow. – Per Nov 18 '11 at 21:21
  • @ValentinKuzub: another important feature of the hashing functions mentioned in this paper is that they are defined on sets instead of strings, hence making the values invariant from the ordering of elements in the set, IMHO – derekhh Nov 18 '11 at 21:31
  • @Per: Yes, that is indeed a problem...btw, are you Per Austrin? – derekhh Nov 18 '11 at 21:34
  • @derekhh No, I'm not Per Austrin. – Per Nov 18 '11 at 21:44
2

I think your squaring idea is going in the right direction, but a poor choice of function. I'd try something more like the PRNG functions or just multiplication by a large prime, followed by XOR of all the resulting values.

phkahler
  • 5,687
  • 1
  • 23
  • 31
1

If the range of the values in key happens to be limited to low-ish positive integers, you could map each one to a prime number using a simple lookup, then multiply them together to arrive at the value.

Using the example in the question:

[1, 2, 3] maps to 2 x 3 x 5 = 30
[3, 2, 1] maps to 5 x 3 x 2 = 30
0

One possibility: sort the items in the list, then hash that.

Joe
  • 3,804
  • 7
  • 35
  • 55
0

You could sort the numbers and select a sample from predetermined indices and leave rest as zero if current value has less numbers. Or you could xor them, or whatever.

perreal
  • 94,503
  • 21
  • 155
  • 181
0

Why not something like

public int GetOrderIndependantHashCode(IEnumerable<int> source)
{
    return (source.Select(x => x*x).Sum()
            + source.Select(x => x*x*x).Sum()
            + source.Select(x => x*x*x*x).Sum()) & 0x7FFFFF;
}
Ivan Bianko
  • 1,749
  • 15
  • 22
  • remember that we are battling sort approach. we got a lot of multiplications and summations going on here, sorting might outmatch this. – Valentin Kuzub Nov 18 '11 at 21:58
-1

Create your own type that implements IEnumerable<T>.

Override GetHashCode. In it, sort your collection, call and return ToArray().GetHashCode().

Daniel A. White
  • 187,200
  • 47
  • 362
  • 445