4

I am brainstorming for a project which will store large chunks of coordinate data (latitude, longitude) in a database. Key aspects of this data will be calculated and stored, and then the bulk of the data will be compressed and stored. I am looking for a lossless compression algorithm to reduce the storage space of this data. Is there an (preferably common) algorithm which is good at compressing this type of data?

Known attributes of the data

  • The coordinate pairs are ordered and that order should be preserved.
  • All numbers will be limited to 5 decimal places (roughly 1m accuracy).
  • The coordinate pairs represent a path, and adjacent pairs will likely be relatively close to each other in value.

Example Data

[[0.12345, 34.56789], [0.01234, 34.56754], [-0.00012, 34.56784], …]

Note: I am not so concerned about language at this time, but I will potentially implement this in Javascript and PHP.

Thanks in advance!

Nate
  • 2,035
  • 8
  • 23
  • 33

3 Answers3

4

To expand on the delta encoding suggested by barak manos, you should start by encoding the coordinates as binary numbers instead of strings. Use four-byte signed integers, which each equal to 105 times your values.

Then apply delta encoding, where each latitude and longitude respectively are subtracted from the previous one. The first lat/long is left as is.

Now break the data into four planes, one for each of the four-bytes in the 32-bit integers. The higher bytes will be mostly zeros, with all of the entropy in the lower bytes. You can break the data into blocks, so that your planes don't have to span the entire data set.

Then apply zlib or lzma compression.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Mark, did you suggest four-byte signed integers simply to be rid of the decimals? And why should I break the data into four planes, so that I can take advantage of run-length encoding on the higher bytes? – Nate Jan 04 '14 at 14:50
  • Both to get rid of the ASCII text, and then, yes, to avoid a floating point representation, which is not needed here. – Mark Adler Jan 04 '14 at 16:00
  • 2
    The four planes is to get better compression on the more significant planes. If the high bytes and low bytes are mixed, as there are normally in a series of integers, then its harder for the compressor to take advantage of the low entropy. – Mark Adler Jan 04 '14 at 16:01
  • You could go further and break it into 32 planes of bits instead of bytes. Then you could then immediately drop the top six or seven planes (longs and lats respectively), since they will always be zero. – Mark Adler Jan 04 '14 at 19:28
2

I would recommend that you first exploit the fact that adjacent symbols are similar, and convert your data in order to reduce the entropy. Then, apply the compression algorithm of your choice on the output.

Let IN_ARR be the original array and OUT_ARR be the converted array (input for compression):

OUT_ARR[0] = IN_ARR[0]
for i = 1 to N-1
    OUT_ARR[i] = IN_ARR[i] - IN_ARR[i-1]

For simplicity, the pseudo-code above is written for 1-dimension coordinates.

But of course, you can easily implement it for 2-dimension coordinates...

And of course, you will have to apply the inverse operation after decompression:

IN_ARR[0] = OUT_ARR[0]
for i = 1 to N-1
    IN_ARR[i] = OUT_ARR[i] + IN_ARR[i-1]
barak manos
  • 29,648
  • 10
  • 62
  • 114
  • Barak, does this technique have a name? – Nate Jan 04 '14 at 14:41
  • @Nate: Basic video compression takes a set of consecutive frames (images) and divides it into the first frame (called INTRA) and the remaining frames (called INTER). The INTRA frame is compressed as is, but for each INTER frame, the algorithm first computes the difference from the previous frame (which mostly consists of zero or close-to-zero values), and then compresses that difference instead of the frame itself. Since the entropy (diversity) of the difference is much lower, compression rate is potentially much higher. – barak manos Jan 04 '14 at 14:52
  • @Nate: And of course, when the difference becomes to high (for example, when a new scene in the movie begins), the algorithm starts a new set of INTRA/INTER. Not sure about the official name for this method, but my suggestion above uses the same principle. – barak manos Jan 04 '14 at 14:54
2

Here is way to efficiently structure your data to get most out of it : -

  1. First divide your data in two sets as integer and decimals :-

    eg: [1.23467,2.45678] => [1,2] and [23467,45678] => [1],[2],[23467],[45678]

  2. As your data seems random then first thing you can do for compression is not to store it as string directly but use following compression.

  3. range of latitudes is -90 to +90 hence total 180 values hence need log2(180) bits that is 8 bits per integer for first values

  4. range of longitutes is -180 to 180 which is 360 values hence log2(360) bits which is 9 bits

  5. decimals are of 5 digits hence need log2(10^5) = 17 bits.

  6. Use above compression you will need 8+9+17*2 = 51 bits per record whereas if you use strings then you would need 2 + 3 + 5*2 = 15 bytes per record at max.

  7. compression ratio = 51/(15*8) = 42% if compared with string data size

  8. compression ratio = 51/(2*32) = 80% if compared with float data size .

  9. Group similar parts of the path into 4 group like for example : -

[[0.12345,34.56789],[0.01234,34.56754],[-0.00012,34.56784]...]

=> [0,0,-0],[34,34,34],[12345,1234,12],[56789,56754,56784]

Use delta encoding on the individual group and then apply huffman coding to get further compression on total data.

Vikram Bhat
  • 6,106
  • 3
  • 20
  • 19