0

To save space in the executable, I want to compute checksums (or hashes) over ASCII string and later use a checksum to look the corresponding string.

This saves space, since I don't have to fill up the executable with ASCII strings; instead only, say 32-bit integers are stored instead.

Now, for this idea to work, I need a checksum algorithm that is able to compute unique checksums for strings up to N characters. Because, most of the strings are identifiers, N=20 would be acceptable.

Does anyone know of a checksum algorithm that satisfies my criteria?

Theory: Since a checksum algorithm maps {0,1}^* -> {0,1}^m an infinite number of collisions exist in general. However, here I consider only strings of up to N characters, so checksum (compress) algorithms mapping {0,1}^N -> {0,1}^m, with N<=m, are guaranteed to exist without collisions (injective).

Shuzheng
  • 11,288
  • 20
  • 88
  • 186
  • If you're increasing the amount of bits what would be the advantage? – hugos May 06 '17 at 18:42
  • Indeed, if I increase `m` then collisions have lower probability, but I need to be certain. A wrong match will make the application crash! – Shuzheng May 06 '17 at 18:43
  • @Shuzheng Your theory is nice and all, but it doesn't fit your use case. If `N<=m`, where exactly do you see that you need *less* storage space? – Artjom B. May 06 '17 at 18:46
  • If you use any secure cryptographic hash function, collision must be very unlikely. In fact, if you find a collision in a cryptographic hash function they are automatically considered insecure. – hugos May 06 '17 at 18:46
  • Take a look at the hashlib module https://docs.python.org/2.7/library/hashlib.html – hugos May 06 '17 at 18:49
  • 1
    If you have a fixed, known set of strings, google "perfect hashing", then "minimal perfect hash function" (one which maps N strings to [0..N-1]). Your call whether they qualify as "simple". – Mischa May 06 '17 at 18:47
  • 2
    You're making things difficult for yourself. If you want a function that's *guaranteed* to hash any 20-byte sequence to a unique value, you'll have to use a 160-bit hash function, which is hardly practical because (a) each hash value will need 20 bytes of storage, and (b) it would be impossible to build a hash table with 2^160 buckets. I suggest you study the existing techniques for resolving collisions in hash tables (e.g., chaining with linked lists, or open addressing), and choose one of those instead of pursuing your current approach. If you're writing in Python, just use a dictionary. – r3mainer May 06 '17 at 23:50
  • You all seem to be ignoring the fact that only a small subset of byte sequences are ASCII strings consisting of A-Za-z0-9, so we can rule out many candidates. Therefore, the set of, say, 20 bytes sequnces under considerstion is smaller. Whether other byte sequences collide I don't care. – Shuzheng May 07 '17 at 05:19
  • So, first you would have to encode your strings to byte sequences. If we assume Base62 (A-Za-z0-9), this would mean a reduction from 20 ASCII bytes to roughly 15 binary bytes. If `m` is supposed to be smaller than 15 bytes, then this cannot be answered. Please add all your requirements to your question. – Artjom B. May 07 '17 at 07:26
  • @Shuzheng For a 20 character string consisting of `A-Za-z0-9`, there are `62^20` possible strings. That's much, much more than the measly `2^32` strings you can represent with 32 bits. You need at least 120 bits for a generic guaranteed-collision-less hash since `2^119 < 62^20 < 2^120`, and the simplest conversion for that would be simply changing a base-62 number to a base-2 one. – Bernhard Barker May 07 '17 at 07:57
  • If your strings all follow a specific format (e.g. the first 5 characters are always numbers, the 6-10th characters are lowercase letters, etc.), that would allow for you to represent the string using less bits, but, without such constraints, you're left with needing 120 bits. – Bernhard Barker May 07 '17 at 08:11
  • If the complete set of inputs is known beforehand, you can use perfect hashing. – CodesInChaos May 07 '17 at 13:43
  • What is that? @CodesInChaos – Shuzheng May 07 '17 at 14:49

1 Answers1

1

If your hashes are 32 bit integers, then you have 2^32 possible hash codes. A 20 character ASCII string has 7 x 20 = 140 bits minimum, 8 x 20 = 160 bits if you are working in bytes. Original ASCII is a 7-bit code, hence the difference.

You cannot fit 140 bits into 32 bits without duplicating some hash values.

A unique checksum for 20 ASCII character strings would need a minimum of 140 bits, probably more like 160 bits.

rossum
  • 15,344
  • 1
  • 24
  • 38