Compression algorithm for IEEE-754 data

Question

Anyone have a recommendation on a good compression algorithm that works well with double precision floating point values? We have found that the binary representation of floating point values results in very poor compression rates with common compression programs (e.g. Zip, RAR, 7-Zip etc).

The data we need to compress is a one dimensional array of 8-byte values sorted in monotonically increasing order. The values represent temperatures in Kelvin with a span typically under of 100 degrees. The number of values ranges from a few hundred to at most 64K.

Clarifications

All values in the array are distinct, though repetition does exist at the byte level due to the way floating point values are represented.
A lossless algorithm is desired since this is scientific data. Conversion to a fixed point representation with sufficient precision (~5 decimals) might be acceptable provided there is a significant improvement in storage efficiency.

Update

Found an interesting article on this subject. Not sure how applicable the approach is to my requirements.

https://userweb.cs.txstate.edu/~burtscher/papers/dcc06.pdf

A 'lossy' algorithm IS acceptable because your data is not discrete. There is a real maximum physical rate of change and real accuracy of the sensor - so any lossy encoding with sufficient sampling bandwidth is OK. — Martin Beckett, Feb 10 '10 at 19:16
Martin, thanks for your answer. Technically you are correct, but not all design decisions are based purely on technical considerations. In this case we need to preserve the exact values since they represent the "acceptable" results from another vendor's sampling decisions. — David Taylor, Feb 17 '10 at 14:34
A current link to the paper: http://cs.txstate.edu/~burtscher/papers/dcc06.pdf — Victor Sergienko, Aug 01 '16 at 22:38

comingstorm · Answer 1 · 2010-02-10T19:48:17.443

First thing to consider: try compressing the data before you convert it to double precision. Re your comment to David Thornley, unless your IR imaging ADC's have 24 significant bits, 32-bit floats should have more than enough precision; it is only your requirement to exactly preserve the noise generated by subsequent processing that is a problem. Failing that, it might conceivably be practical to reverse-engineer your processing, by determining a table of values it generates, and storing an index to this table instead.

Second: if your compression algorithm knows that your data is in 8-byte chunks, it will be much more effective; this is because it will not throw your most significant bytes in with the least significant bytes. As a crude preprocessing method, you could try prefixing each double with a distinctive byte (ASCII comma, perhaps?) before piping it through a byte-based compressor like gzip; this should result in better total compression even though the intermediate stream is 12% larger. Less crude but more effort would be to write your own compression adapted to this task -- perhaps using an 8-level tree to represent the expected values of each byte in your double.

Third: as image data is highly redundant, some form of delta coding or other image-related compression should save some space. However, it will not gain you a terribly large amount if you demand lossless compression, as the image noise is inherently incompressible. Also, it will not help you deal with the pseudo-random hash in the less-significant bits of your doubles, as explained above.

Very perceptive. I am already "palettizing" the data and compressing the resulting array using techniques optimized for 16-bit grayscale (actually multi-planar for A/D > 16-bit). The resulting temperature lookup is the part I need to compress more efficiently. I have also tested a variety approaches to address the "chunking" issue you note with some success (e.g. compressing the nth byte of each chunk together). The 8-level tree idea sounds interesting, but will take some work. Delta coding with a good predictor function might work well. — David Taylor, Feb 11 '10 at 02:40
There is a more efficient generalization of palletization called Vector Quantization. See here for example: http://www.gamasutra.com/view/feature/3090/image_compression_with_vector_.php This is very heavy on processing time for compression, but super lightweight on decompression. Perhaps this is though what you are doing already. — 3yE, Mar 28 '10 at 22:58
Thanks for the pointer on Vector Quantization, a very interesting approach. Learn something new every day! — David Taylor, Mar 29 '10 at 12:58

score 5 · Answer 2 · answered Mar 28 '10 at 22:51

All the encoders you list are byte oriented, and are thrown off by a few properties of doubles. For one there is the layout where the 12-bit exponent/sign doesn't really play well with byte boundaries, for other there's the noisiness of your input. The first part is easy to deal with in multitude of ways, the second will limit the effectiveness of any lossless compression that you throw at it. I think that even the best result will be less than amazing, i don't know your data but i'd suspect you can count on mere 25% save, more or less.

From the top of my head, and perhaps useless because you have thought of everything on this list...

Treat the stream as 64-bit integers and delta-encode adjacent values. If you have runs of values with the same exponent, it will effectively zero it out, as well as possibly some high mantissa bits. There will be overflows, but the data still needs only 64 bits and the operation can be reveresed.
At this stage you can optionally try some crude integer prediction, and save differences.
If you have followed the suggestion before, you will have almost half values starting with 000... and almost half with FFF... To eliminate that, rotate the value to the left (ROL) by 1 bit and XOR it with all Fs if the current LSB is 1. Reverse is XOR with Fs if LSB is 0 then ROR.

On the second thought simply XORing predictions to true values can be better than difference, because you don't have to do step 3 then.

You can try reordering bytes to group bytes with same significance together. Like, first all most significant bytes, and so on. At the very least you should get something like a massive run of zeroes with at most few bits of noise first.
Run through a generic compressor or even first RLE on the run of zeroes, then an entropy encoder, like huffman, or better, range encoder from 7zip/LZMA.

There is one good thing about your data, it is monotonous. There is a bad thing about your data: it's simply too small a set. How much do you want to save, mere kilobyes? what for? The compression effectiveness will suffer a lot if there is often exponent difference between adjacent values.

If you are processing large number of those data sets, you should consider using their similarity to compress them together better - perhaps interleave them at some stage. If you can live with some loss, zeroing out some least significant bytes might be a good idea - perhaps both on source data and on prediction so that you don't reintroduce noise there.

Your observations are on the mark. I have experimented with all of the methods you have suggested. The approach that seems to work the best is a custom predictor algorithm and encoding model that uses XOR as per your suggestion. The reason compression is important is because many such images are being stored. The floating point compression is actually but a small part of a larger strategy which uses palletization and JPEG-LS 16-bit grayscale compression. Currently we are able to achieve around 65% compression which is quite good IMO. Thx — David Taylor, Mar 29 '10 at 12:56

score 3 · Answer 3 · edited Jan 18 '18 at 03:02

If you want high-compression archival storage, "High Throughput Compression of Double-Precision Floating-Point Data" by Burtscher & Patanaworabhan or "Fast and Eﬃcient Compression of Floating-Point Data" by Lindstrom & Isenberg may be helpful to you.

If you want faster dynamic access at the expense of a lower compression rate then a 1D lifting wavelet may be appropriate. You can quantize the data to smaller integers by specifying the number of digits to keep. Then use delta encoding with a predictive model followed by transformation with Haar or a more expensive wavelet transform and arithmetic encoding of the coefficients greater than a specified value.

hope it helps

You can get Lindstrom's ZFP algorithm here: https://github.com/LLNL/zfp

score 2 · Answer 4 · answered Feb 10 '10 at 17:34

2

Compression algorithms live on repetitions and regularities, and floating-point numbers don't do well at that.

The first question is whether you can use single-precision floating-point values, which will give you 50% compression immediately. Few thermometers are accurate to seven digits, and the exponent will represent temperatures considerably below anything I've been told you can actually get.

If not, can you filter your temperatures, rounding them off to the equivalent of N digits (more likely N/.301 bits)? That may introduce enough regularities to be useful.

If you really have to store 64 bits of information for each temperature reading, and all the bits are significant and not predictable from the other bits, then you effectively can't compress it.

answered Feb 10 '10 at 17:34

David Thornley

56,304
9
91
158

Single precision floats due not have sufficient precision. The data is related to infrared imaging with with device sensitivity on the order of 0.08 Celsius or better. While the floating point values are non-repetitive, there is significant repetition at the byte level which the compression algorithms we have tried are unable to take advantage of due to their design. Based on some Google searches, this is a known problem with compression of scientific data. – David Taylor Feb 10 '10 at 17:55
@David -- Can you explain why you think single precision floats are not adequate ? With a span of 100 degrees and a resolution of even 0.01 degrees a single precision float should have more than enough precision/resolution. – Paul R Feb 10 '10 at 18:31
Um, yeah. A single precision float will get you six significant digits easy. For a range of, say, 0-1000K, that will get you resolution better than 0.001 kelvin, which is a whole lot better than your imager sensitivity. Are you trying to preserve the random noise or something? – David Thornley Feb 10 '10 at 21:21
1

Technically yes, but the end customer expects the "same" value as their original image. Data + Noise in -> gospel out ;) The loss in precision is readily apparent to the end user, though the meaningfulness of the "lost data" is insignificant. The perception of data loss out weights the rational analysis. Such is life. – David Taylor Feb 11 '10 at 02:46

Martin Beckett · Answer 5 · 2010-02-10T19:15:03.037

Can you do a delta between adjacent values?
Is there a limit to how much a value can change between measurements? Is it acceptable to limit this change to some maximum rate value (at the cost of introducing some smoothing?)

There obviously a limit to the real accuracy of the values from the thermal sensor, do you need to store 64bits of precision or are you better storing an integer number of say 0.01-Kelvin units?

If you can live with some more error and the increase is relatively smooth you might be better just fitting a function to the data and storing only a few terms of the function.

EDIT:
Take a look at a typical data set and look at the range of differences between adjacent values. Then look at what accuracy you need to represent this.

eg. If you have a max 1deg difference between readings you could store changes of 1/256 of this in a byte. If you need to store a bigger range or more accuracy use a short divided by some factor.
So next reading would be = last_reading + (float)increment/256.0

We have considered using Delta encoding of some sort, but have not yet devised an algorithm. The values relate to an infrared image and thus do not have any significant discontinuities. — David Taylor, Feb 10 '10 at 17:45

score 1 · Answer 6 · answered Feb 10 '10 at 17:22

You could think about re-coding your data with an entropy coder (Huffman, Shannon-Fano, Arithmetic Coding). But this will only provide good results if you have many repetitions of the datapoints and if you know which symbols will appear with which probability.

score 1 · Answer 7 · answered Jul 14 '22 at 17:25

Recently, I did an investigation into lossless compression where I tried breaking the floating point values out into the constituent parts -- sign bits, exponent, and mantissa (further subdivided into 3 parts), then compressed each of these separately using conventional means including a differencing predictor and either Deflate or Huffman coding. The reduction in size was modest, only about 50 percent, but at least it was non-lossy. It does show that even in floating point data, there are places where you can find redundancy.

I've posted a write-up at Lossless Compression for Floating Point Data There's Java source code associated with the write up that may useful if you're still interested in the problem.

Compression algorithm for IEEE-754 data

7 Answers7