1

I'm looking for a hash function that I can use to give uniform unique IDs to devices that connect to our network either using a GSM modem or an ethernet connection.

So for any given device I have either an IMEI number or a MAC address hard-coded that I can use to generate the hash.

I've been researching hash functions for the last few hours, reading up on the different non-cryptographic and cryptographic hashes that I might want to use. My focus is low-collisions over performance, as the hash will not be calculated very often.

My front-runners are MD5, FNV-1a, MurmurHash2, Hsieh, and DJB.

Whatever hash I use will have to be implemented in C and will be used on a microcontroller with a tiny processor.

I know that the trick to choosing a good hash function for your needs is knowing what sort of input you're going to be feeding it.

The reason I'm asking this question is that the idea popped into my head that both IMEI and MAC have finite lengths and ranges, so perhaps there exists a fairly simple hash function that can cover the full sets of both and not have collisions. (Thus, a perfect hash function)

An IMEI number is 15 decimal digits long (12-13 bytes in hex?), and a MAC address is 6 bytes. Mulling it over I don't think you would have collisions between the two sets of input numbers, but feel free to correct me if that is wrong. If you did could you do something to prevent it? Add some seed to one of the sets?

Am I on the right track? Is finding perfect hash function for these combined sets possible?

Thanks!

Update

Thanks for the answers and comments. I ended up using the identity function ;) as my hash function, and then also using a mask since there is potential overlap across the sets of numbers.

IMEI, IMEISV, and MAC will all fit in 6.5 bytes or less, so I am storing my values in 7 bytes and then doing a bitwise OR on the first byte with a mask based on which set the number comes from, to ensure they are unique across all sets.

Grekker
  • 944
  • 1
  • 9
  • 17
  • And yes, before someone says it, I considered just using the numbers themselves as the identifiers :) I'd prefer an identifier with a uniform length. – Grekker Aug 16 '11 at 15:33
  • That's a really specific case -- have you done any analysis using existing commonly-used hash functions (e.g. SHA1, MD5) to see their incidents of collisions? – Joe Aug 16 '11 at 15:35
  • MD5, for instance, gives you a 128-bit hash, which is longer than the IMEI (~48 bits) and the MAC address (48 bits) combined. So why not just use the original values? – Oliver Charlesworth Aug 16 '11 at 15:38
  • @Oli, see my comment above :) But yes, that is the simplest approach. – Grekker Aug 16 '11 at 15:41
  • @Grekker: This *will* have uniform length. You can represent your IMEI with 6 bytes (or 8 bytes if you want to use a standard type), and your MAC address with 6 bytes. – Oliver Charlesworth Aug 16 '11 at 15:42
  • @Joe, how might I go about doing that? – Grekker Aug 16 '11 at 15:43
  • @Oli, you are right sir. Made a fundamental mistake thinking 15 decimal digits = 12-13 bytes. It's half that. – Grekker Aug 16 '11 at 15:47

1 Answers1

3

There's no way to make a perfect hash over an unknown, growing input set. You could simply make the field one bit larger than whichever of IMEI or MAC is larger, and use that bit to flag which type of identifier it is, along with the entire IMEI/MAC. Anything smaller will have collisions, but they're probably quite rare.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I agree with your first sentence, but I'm not sure this is an unknown, growing set. Both IMEI and MAC have maximum lengths, so if your hash function produces a digest longer than the maximums you should be able to find a perfect hash function, right? – Grekker Aug 16 '11 at 15:42
  • 1
    @grekker: Yes. A trivial example is `f(x) = x`. – Oliver Charlesworth Aug 16 '11 at 15:47
  • Yes, the identify function is a perfect hash function. What's the benefit of using something more complicated? – Keith Thompson Aug 16 '11 at 15:48
  • 1
    I meant it's impossible to make a perfect hash *smaller than the input size in bits*, when you don't already have a fixed subset of the possible input values you have to consider. – R.. GitHub STOP HELPING ICE Aug 16 '11 at 15:49
  • Wasn't my answer already advocating using the identity function, crossed with one extra bit to distinguish whether it's an IMEI or MAC? – R.. GitHub STOP HELPING ICE Aug 16 '11 at 15:50
  • Thanks for the informative answers and comments, guys. My question stemmed from a false assumption, which was that IMEI and MAC had very different byte lengths. Since both can fit in 8 bytes, I'll just follow R..'s suggestion and flip a bit to indicate which set the identifier comes from. Thanks! – Grekker Aug 16 '11 at 16:05