0

let's say I have:

n = 14

n is the result of the following sums of integers:

[5, 2, 7] -> 5 + 2 + 7 = 14 = n
[3, 4, 5, 2] -> 3 + 4 + 5 + 2 = 14 = n
[1, 13] -> 1 + 13 = 14 = n
[13, 1] -> 13 + 1 = 14 = n
[4, 3, 5, 2] -> 4 + 3 + 5 + 2 = 14 = n
...

I would need a hash function h so that:

h([5, 2, 7]) = h([3, 4, 5, 2]) = h([1, 13]) = h([13, 1]) = h([4, 3, 5, 2]) = h(...)

I.e. it doesn't matter the order of the integer terms and as long as their integer sum is the same, their hash should also the same.

I need to do this without computing the sum n, because the terms as well as n can be very high and easily overflow (they don't fit the bits of an int), that's why I am asking this question.

Are you aware or maybe do you have an insight on how I can implement such a hash function? Given a list/sequence of integers, this hash function must return the same hash if the sum of the integers would be the same, but without computing the sum.

Thank you for your attention.

EDIT: I elaborated on @derpirscher's answer and modified his function a bit further as I had collisions on multiples of BIG_PRIME (this example is in JavaScript):

function hash(seq) {
  const BIG_PRIME = 999999999989;
  const MAX_SAFE_INTEGER_DIV_2_FLOOR = Math.floor(Number.MAX_SAFE_INTEGER / 2);
  let h = 0;
  for (i = 0; i < seq.length; i++) {
    let value = seq[i];
    if (h > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
      h = h % BIG_PRIME;
    }
    if (value > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
      value = value % BIG_PRIME;
    }
    h += value;
  }
  return h;
}

My question now would be: what do you think about this function? Are there some edge cases I didn't take into account?

Thank you.

EDIT 2:

Using the above function hash([1,2]); and hash([4504 * BIG_PRIME +1, 4504 * BIG_PRIME + 2]) will collide as mentioned by @derpirscher.

Here is another modified of version of the above function, which computes the modulo % BIG_PRIME only to one of the two terms if either of the two are greater than MAX_SAFE_INTEGER_DIV_2_FLOOR:

function hash(seq) {
  const BIG_PRIME = 999999999989;
  const MAX_SAFE_INTEGER_DIV_2_FLOOR = Math.floor(Number.MAX_SAFE_INTEGER / 2);
  let h = 0;
  for (let i = 0; i < seq.length; i++) {
    let value = seq[i];
    if (
      h > MAX_SAFE_INTEGER_DIV_2_FLOOR &&
      value > MAX_SAFE_INTEGER_DIV_2_FLOOR
    ) {
      if (h > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
        h = h % BIG_PRIME;
      } else if (value > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
        value = value % BIG_PRIME;
      }
    }
    h += value;
  }
  return h;
}

I think this version lowers the number of collisions a bit further.

What do you think? Thank you.

EDIT 3:

Even though I tried to elaborate on @derpirscher's answer, his implementation of hash is the correct one and the one to use.

Use his version if you need such an hash function.

tonix
  • 6,671
  • 13
  • 75
  • 136
  • 1
    I'd say that computing the sum ist the cheapest way to calculate a hash value. And if you'd get overflows when doing additions, you'll get overflows with most other operations as well. The logical operations that won't overflow probably also won't satisfy the requirement. BTW always issuing 37 as a hash value, regardless the input, would satisfy your requirement. – Ronald Oct 28 '21 at 07:16
  • 1
    Acutually, there is no difference at all to my approach. You are just using a bigger modulus, because the the range of javascript's `number` is bigger than c# `int` (which I took as basis for my answer). You will get the inevitable collisions (but at bigger values), once the elements (or the sum) in your sequence get bigger than `MAX_SAVE_INTEGER_DIV_2_FLOOR`. Try `hash([1,2]);` and `hash([4504*BIG_PRIME +1, 4504*BIG_PRIME +2])` they both will return `3`. – derpirscher Oct 29 '21 at 06:46
  • 1
    Furthermore, I'd suggest to use a value for `BIG_PRIME` that is as near as possible to `MAX_SAVE_INTEGER_DIV_2_FLOOR` because that will reduce the number of collisions dramatically. – derpirscher Oct 29 '21 at 06:55
  • 1
    And there is literally no difference between `x = x % N` and `if (x > N) x = x % N` because any reasonable implementation of `%` will do this conditional check anyways, and if `x < N` just return `x` without further calculations. – derpirscher Oct 29 '21 at 07:00
  • Thank you for the clarification, @derpirscher. In my case `N = BIG_PRIME = 999999999989`, but the test checks against a different number, `Y = MAX_SAFE_INTEGER_DIV_2_FLOOR`, so it's `if (x > Y) x = x % N` in my case. – tonix Oct 29 '21 at 08:32
  • `Try hash([1,2]); and hash([4504*BIG_PRIME +1, 4504*BIG_PRIME +2]) they both will return 3.` Yeah, they collide. – tonix Oct 29 '21 at 08:33
  • What if I compute the modulo `value % BIG_PRIME` only to one of the two terms of the sum `h + value`? This way `hash([1,2]);` and `hash([4504*BIG_PRIME +1, 4504*BIG_PRIME +2])` don't collide. Please, check my EDIT 2, thank you! – tonix Oct 29 '21 at 08:49
  • 1
    It doesn't matter. Once you use a modulo operation, you get collisions. Fact. Maybe for different inputs, but they are inevitable. Btw. With your latest approach you may produce overflows. Assume `h == 20` and `value == Number.MAX_SAFE_INTEGER - 1`. Then you are computing `h = 20 + value` which will result `h` to be `MAX_SAFE_INTEGER + 19` ... – derpirscher Oct 29 '21 at 09:31
  • 1
    BTW, your `else if` branch will never be executed, because once you enter the body of the outer `if` the condition of the inner `if` is always `true` ... – derpirscher Oct 29 '21 at 09:36
  • You are right, I am going back to the implementation of `hash` in your answer, as it is the correct one. Thank you again! – tonix Oct 29 '21 at 09:42

1 Answers1

1

You could calculate the sum modulo some big prime. If you want to stay within the range of int, you need to know what the maximum integer is, in the language you are using. Then select a BIG_PRIME that's just below maxint / 2

Assuming an int to be 4 bytes, maxint = 2147483647 thus the biggest prime < maxint/2 would be 1073741789;

int hash(int[] seq) {
  BIG_PRIME = 1073741789;
  int h = 0;
  for (int i = 0; i < seq.Length; i++) {
    h = (h + seq[i] % BIG_PRIME) % BIG_PRIME;
  }
  return h;
}

As at every step both summands will always be below maxint/2 you won't get any overflows.

Edit

From a mathematical point of view, the following property which may be important for your use case holds:

(a + b + c + ...) % N  == (a % N + b % N + c % N + ...) % N

But yeah, of course, as in every hash function you will have collisions. You can't have a hash function without collisions, because the size of the domain of the hash function (ie the number of possible input values) is generally much bigger than the the size of the codomain (ie the number of possible output values).

For your example the size of the domain is (in principle) infinite, as you can have any count of numbers from 1 to 2000000000 in your sequence. But your codomain is just ~2000000000 elements (ie the range of int)

derpirscher
  • 14,418
  • 3
  • 18
  • 35
  • I like this idea, thank you! – tonix Oct 28 '21 at 11:59
  • How does this algorithm work when all values of `seq` are `> BIG_PRIME`? It basically sums the remainders of the division by `BIG_PRIME`, e.g. `seq = [BIG_PRIME + 2, BIG_PRIME + 7, BIG_PRIME + 3]` so I guess that as long as the sum of these remainders is the same, two sequences will have the same hash, e.g. `hash([BIG_PRIME + 2, BIG_PRIME + 7, BIG_PRIME + 3]) = hash([BIG_PRIME + 12]) = hash([BIG_PRIME + 7, BIG_PRIME + 5])`. But how can I make `hash([7, 5])` produce a different hash? Thank you! – tonix Oct 28 '21 at 15:58
  • To generalize, for any multiple `n : {0, 1, 2, 3, 4, 5, 6, ...}` of `BIG_PRIME`, `hash([n * BIG_PRIME + 2, n * BIG_PRIME + 7, n * BIG_PRIME + 3])` will lead to the same hash as `hash([2, 7, 3])`, but I would like sequence `[2, 7, 3]` to have a different hash than e.g. `hash([BIG_PRIME + 2, BIG_PRIME + 7, BIG_PRIME + 3])` or `hash([2 * BIG_PRIME + 2, 2 * BIG_PRIME + 7, 2 * BIG_PRIME + 3])` or `hash([3 * BIG_PRIME + 2, 3 * BIG_PRIME + 7, 3 * BIG_PRIME + 3])` and so on... – tonix Oct 28 '21 at 16:24
  • That's not possible with this approach. For that you'll need something else. But that, will as any other hash function also have collisions .. – derpirscher Oct 28 '21 at 19:10
  • What do you think about my EDIT? It seems to work well – tonix Oct 28 '21 at 21:28
  • @tonix Actually, there is no difference at all to my approach. You are just using a bigger datatype and therefore pushing the limit where you get collions a bit higher ... – derpirscher Oct 29 '21 at 06:58
  • Using modulo doesn't lead to a great distribution of hash values unless the original values already have great distribution which is bad if the hash function is used for a hash map – Jimmy T. Sep 02 '22 at 08:59
  • @JimmyT. you are right. But that wasn't the point of the question ... – derpirscher Sep 02 '22 at 11:29
  • I've realized that your mathematical proof doesn't equal the code where you always use modulo on partial sums, making the order relevant. Maybe I did something wrong but I found two cases with the same numbers (so same sum) but in different order: `hash([1073741789,1073741789,73741789,2073741790]) = 1073741790` `hash([1073741789,73741789,2073741790,1073741789]) = 1` – Jimmy T. Sep 03 '22 at 15:07
  • @JimmyT. you are right. I forgot a final `% BIG_PRIME` ... – derpirscher Sep 03 '22 at 15:17