Shrink string encoding algorithm

Question

How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.

The original string can contain the characters A-Z (upper case), 0-9 ans a dash.

If you want lossless encoding, it's impossible. There are 20^128 possible ASCII strings of length 20, and only 6^128 strings of length 6. There's no way you can cram the first category into the second. — Kevin, Dec 24 '13 at 18:08
It's not possible to guarantee this. You can only compress strings that have some kind of repetition that can be encoded. — Barmar, Dec 24 '13 at 18:08
@Kevin You got the formulas backward. It's 128^20 and 128^6. — Barmar, Dec 24 '13 at 18:09
Oops, did I? Well, even so, the first number is bigger than the second, so my original point is still valid. Recommended reading: [pigeonhole principle](http://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_applications), in particular the bit that says, "any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger." — Kevin, Dec 24 '13 at 18:11
If you allow any 20-character string, then the other comments are correct; it's impossible. But if you can assume that the 20 input characters are all ASCII letters, with no control characters, you can play some tricks. You can subtract 0x40 from the value, so that the entire alphabet, upper and lower, fits into the range 0x01 to 0x3A. If you assume your input is only uppercase ASCII letters, that gives you even more bits to play with. Then treat the whole input as a series of bits and compress it down. The output chars will be in the range 0x01 to 0xFF and won't necessarily be printable. — shoover, Dec 24 '13 at 18:20
Thanks for all comments, the allowed characters are A-Z, 0-9 and - Ex: 283F-233012931 — Ramu, Dec 24 '13 at 18:38
@Ramu, "the allowed characters" are input characters, output characters, or both? — Daniel, Dec 24 '13 at 18:39
Hi Daniel, i didn't understand your question. Input is a string and encoded string. I have to encode given string and compare with given encoded string for validation. One way hash is also fine. — Ramu, Dec 24 '13 at 18:42
Oh! Using a one-way hash makes it much simpler to get within the boundaries. — shoover, Dec 24 '13 at 18:44
@smk the restrictions are 1. All capital letters 2. All numbers 3. Can contain - (dash) — Ramu, Dec 24 '13 at 18:46
Initialize hash to 0. `For each char in input: hash = ((hash * 101) + char) mod 1000000 ` When you're done, you'll have a hash between 0 and 999999, because of the mod. Print that as a string, and that's your output. — shoover, Dec 24 '13 at 18:49
@shoover I'm pretty sure OP wants something reversible (the usual meaning of the term "encode"), which is rather distinct from this sort of hash function... — twalberg, Dec 24 '13 at 18:54
@twalberg one way hashing is also okay, we don't need to decode string to original string. I'm trying as shoover suggested. — Ramu, Dec 24 '13 at 18:59
@JimMischel OP didn't say whether collisions were allowed. OP is giving us information in dribs and drabs. — shoover, Dec 24 '13 at 20:24
Can the string be *any combination* of those 20 characters? Or is there some format that says a dash can only occur in particular places, some positions must contain numbers, etc? Also, are the strings known to you in advance? Without some restriction on the strings, what you ask is not possible. You're trying to stuff 105 bits worth of information into 48 bits. — Jim Mischel, Dec 24 '13 at 20:25

Timothy · Accepted Answer · 2013-12-24T23:49:09.093

If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.

In information theory, given a discrete random variable X={x|x1,...,xn}, the Shannon entropy H(X) is defined as:

enter image description here

where p(xi) is the probablity of X = xi. In your case, X has 20 of 37 possible characters, so it could be {x|x1,...,xn} where n = 37^20. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20. So the Shannon entropy of the input is:

enter image description here

. A char in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.

If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20 values to 128^6 keys. It could be done, but you would got plenty of hash collisions.

In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.

However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!

Did OP state a requirement for losslessness? – shoover Dec 24 '13 at 20:24 — shoover, Dec 24 '13 at 20:24
@shoover I am editing to consider the case of loss:) – Timothy Dec 24 '13 at 20:34 — Timothy, Dec 24 '13 at 20:34

Shrink string encoding algorithm

1 Answers1

Linked