Compression of numeric strings

Question

Can anyone suggest compression algorithms to operate on numeric strings of 20-30 digits ?

score 8 · Answer 1 · answered Feb 15 '10 at 16:16

8

You can easily compress 30 character string down to 15 bytes by just using binary representations of each digit. For example, 1592 can be represented as a series of four-bit values as such:

0001 0101 1001 0010

This, when grouped in groups of two four-bit values, can be represented as §Т in standard ASCII.

Further, if your strings contain many identical consecutive digits, you can implement a variation of Run-Length Encoding.

answered Feb 15 '10 at 16:16

Anton Gogolev

113,561
39
200
288

1

Watch out if you're using C strings. The number 0 can be a problem since C strings are null-terminated. – luiscubal Feb 15 '10 at 16:32
This isn't as efficient as representing the entire string as a big integer. 4 bits can represent 16 different values, so this representation wastes 6/16 = 37.5% of the allocated space. – David Gladfelter Feb 15 '10 at 16:50
+1. Binary Coded Decimal (BCD), while not optimum, is a good first solution. It reduces storage cost by 50% and adds very little runtime overhead. – Jim Mischel Feb 15 '10 at 17:45

Brian R. Bondy · Accepted Answer · 2012-01-30T23:09:09.517

3

Assuming you can have floating point numbers, you have a possibility of 11 symbols:

[0,1,2,3,4,5,6,7,8,9, .]

This means that you need 4 bits per symbol. 3 bits can only represent a maximum of 8 symbols. You can easily use 4 bits per each symbol and get a lot of compression.

If you only have integer digits in your string, an easy solution is to convert to hexidecimal and you can use 4 bits per symbol still while getting a better compression ratio. (since there are no wasted bits with 16 symbols)

If you use Huffman compression you will get an optimal bits/per symbol ratio. You can read more about Huffman compression here.

edited Jan 30 '12 at 23:09

answered Feb 15 '10 at 16:24

Brian R. Bondy

339,232
124
596
636

chose to go with this approach. Thanks Brian! – Vijay Dev Feb 16 '10 at 20:34
huffman compression is silly in this case because it is almost impossible to come up with a set of numbers that have non-near maximal entropy with respect to digits. – twolfe18 Feb 17 '10 at 03:17
It's a better suggestion than using delta encoding, there is no proof that delta encoding will help him at all with the info the OP gave. – Brian R. Bondy Feb 17 '10 at 14:21
@BrianR.Bondy link not working on the redirection to `www.`... http://www.brianbondy.com/compression/huffman/ – I.G. Pascual Jan 30 '12 at 13:24

score 2 · Answer 3 · answered Feb 15 '10 at 16:16

2

Make it 2 15 digit numbers and convert them to 2 64 bit integers? Or are they floats?

answered Feb 15 '10 at 16:16

extraneon

23,575
2
47
51

No they are not floats. Can you explain the answer a bit more? Thanks. – Vijay Dev Feb 15 '10 at 16:19

score 2 · Answer 4 · answered Feb 15 '10 at 16:18

2

Break it up into a couple of unsigned ints?

"9347692367596047327509604839"

becomes:

9 347692367 596047327 509604839

answered Feb 15 '10 at 16:18

luke

36,103
8
58
81

The risk there is that if the left-most sub-sequence of the string is one or more zeroes, those get lost in your representation. – David Gladfelter Feb 15 '10 at 16:40

score 1 · Answer 5 · edited May 23 '17 at 12:06

1

One obvious solution is to "compress" them as a binary numeric representation rather than a string representation. See this stack overflow question for example libraries.

edited May 23 '17 at 12:06

Community

1
1

answered Feb 15 '10 at 16:17

David Gladfelter

4,175
2
25
25

score 1 · Answer 6 · answered Feb 15 '10 at 16:21

1

I would definitely go for the easiest solution, and just store them as integers (of a suitable size, be it 32-bit, 64-bit or 128 bit, depending on needs). Compressing it with an algorithm supporting characters would waste a lot of space, since it would have to cater for a lot more than 10 different values (0-9) per character .

answered Feb 15 '10 at 16:21

Erik A. Brandstadmoen

10,430
2
37
55

A nice idea, but 20 digits won't fit into a 64-bit integer. And although writing extended integer parsing and such isn't rocket science, it's far from trivial. In addition, such conversion would eliminate any leading zeros in the numeric string. – Jim Mischel Feb 15 '10 at 17:41
Leading zeros would have to be accounted for especially, of course, but it wasn't clear whether the string could contain and should preserve leading zeros. Observe, however, that I suggest a "suitable size", which would imply "something big enough to hold your data" :). The argument remains. Just use as many n-bit units you need to store the value. 1) Store any leading zeros (if you wish to keep them) 2) Read as many characters that you can store in a unit/word (32-bit,64-bit, whatever), and store them in a unit 3) Repeat 2) until End of stream. 4) Wrap the units/words in something readable – Erik A. Brandstadmoen Mar 01 '10 at 16:05

score 1 · Answer 7 · edited Feb 15 '10 at 17:44

one of the most common ways to compress numbers (assuming you have more than one you want to compress -- its kind of hard to compress one thing), is using delta encoding. It works on the principle that if you know the first number is x, and the numbers after it are relatively similar, you can encode the subsequent numbers as (x+c1), (x+c2), etc.

In this scheme, you only have to encode the full x value once, and if your c values are smaller than your x's, then you can save a lot of space. You can also use a version of this that sorts the numbers first, and then your delta refers to the number last seen instead of one number. With this method you can cover a wider range of numbers more efficiently.

Compression of numeric strings

7 Answers7