4

Can anyone suggest compression algorithms to operate on numeric strings of 20-30 digits ?

Vijay Dev
  • 26,966
  • 21
  • 76
  • 96

7 Answers7

8

You can easily compress 30 character string down to 15 bytes by just using binary representations of each digit. For example, 1592 can be represented as a series of four-bit values as such:

0001 0101 1001 0010

This, when grouped in groups of two four-bit values, can be represented as §Т in standard ASCII.

Further, if your strings contain many identical consecutive digits, you can implement a variation of Run-Length Encoding.

Anton Gogolev
  • 113,561
  • 39
  • 200
  • 288
  • 1
    Watch out if you're using C strings. The number 0 can be a problem since C strings are null-terminated. – luiscubal Feb 15 '10 at 16:32
  • This isn't as efficient as representing the entire string as a big integer. 4 bits can represent 16 different values, so this representation wastes 6/16 = 37.5% of the allocated space. – David Gladfelter Feb 15 '10 at 16:50
  • +1. Binary Coded Decimal (BCD), while not optimum, is a good first solution. It reduces storage cost by 50% and adds very little runtime overhead. – Jim Mischel Feb 15 '10 at 17:45
3

Assuming you can have floating point numbers, you have a possibility of 11 symbols:

[0,1,2,3,4,5,6,7,8,9, .]

This means that you need 4 bits per symbol. 3 bits can only represent a maximum of 8 symbols. You can easily use 4 bits per each symbol and get a lot of compression.

If you only have integer digits in your string, an easy solution is to convert to hexidecimal and you can use 4 bits per symbol still while getting a better compression ratio. (since there are no wasted bits with 16 symbols)

If you use Huffman compression you will get an optimal bits/per symbol ratio. You can read more about Huffman compression here.

Brian R. Bondy
  • 339,232
  • 124
  • 596
  • 636
  • chose to go with this approach. Thanks Brian! – Vijay Dev Feb 16 '10 at 20:34
  • huffman compression is silly in this case because it is almost impossible to come up with a set of numbers that have non-near maximal entropy with respect to digits. – twolfe18 Feb 17 '10 at 03:17
  • It's a better suggestion than using delta encoding, there is no proof that delta encoding will help him at all with the info the OP gave. – Brian R. Bondy Feb 17 '10 at 14:21
  • @BrianR.Bondy link not working on the redirection to `www.`... http://www.brianbondy.com/compression/huffman/ – I.G. Pascual Jan 30 '12 at 13:24
2

Make it 2 15 digit numbers and convert them to 2 64 bit integers? Or are they floats?

extraneon
  • 23,575
  • 2
  • 47
  • 51
2

Break it up into a couple of unsigned ints?

"9347692367596047327509604839"

becomes:

9 347692367 596047327 509604839

luke
  • 36,103
  • 8
  • 58
  • 81
1

One obvious solution is to "compress" them as a binary numeric representation rather than a string representation. See this stack overflow question for example libraries.

Community
  • 1
  • 1
David Gladfelter
  • 4,175
  • 2
  • 25
  • 25
1

I would definitely go for the easiest solution, and just store them as integers (of a suitable size, be it 32-bit, 64-bit or 128 bit, depending on needs). Compressing it with an algorithm supporting characters would waste a lot of space, since it would have to cater for a lot more than 10 different values (0-9) per character .

Erik A. Brandstadmoen
  • 10,430
  • 2
  • 37
  • 55
  • A nice idea, but 20 digits won't fit into a 64-bit integer. And although writing extended integer parsing and such isn't rocket science, it's far from trivial. In addition, such conversion would eliminate any leading zeros in the numeric string. – Jim Mischel Feb 15 '10 at 17:41
  • Leading zeros would have to be accounted for especially, of course, but it wasn't clear whether the string could contain and should preserve leading zeros. Observe, however, that I suggest a "suitable size", which would imply "something big enough to hold your data" :). The argument remains. Just use as many n-bit units you need to store the value. 1) Store any leading zeros (if you wish to keep them) 2) Read as many characters that you can store in a unit/word (32-bit,64-bit, whatever), and store them in a unit 3) Repeat 2) until End of stream. 4) Wrap the units/words in something readable – Erik A. Brandstadmoen Mar 01 '10 at 16:05
1

one of the most common ways to compress numbers (assuming you have more than one you want to compress -- its kind of hard to compress one thing), is using delta encoding. It works on the principle that if you know the first number is x, and the numbers after it are relatively similar, you can encode the subsequent numbers as (x+c1), (x+c2), etc.

In this scheme, you only have to encode the full x value once, and if your c values are smaller than your x's, then you can save a lot of space. You can also use a version of this that sorts the numbers first, and then your delta refers to the number last seen instead of one number. With this method you can cover a wider range of numbers more efficiently.

Svante
  • 50,694
  • 11
  • 78
  • 122
twolfe18
  • 2,228
  • 4
  • 24
  • 25