0

Summary: The industrial thermometer is used to sample temperature at the technology device. For few months, the samples are simply stored in the SQL database. Are there any well-known ways to compress the temperature curve so that much longer history could be stored effectively (say for the audit purpose)?

More details: Actually, there are much more thermometers, and possibly other sensors related to the technology. And there are well known time intervals where the curve belongs to a batch processed on the machine. The temperature curves should be added to the batch documentation.

My idea was that the temperature is a smooth function that could be interpolated somehow -- say the way a sound is compressed using MP3 format. The compression need not to be looseless. However, it must be possible to reconstruct the temperature curve (not necessarily the identical sample values, and the identical sampling interval) -- say, to be able to plot the curve or to tell what was the temperature in certain time.

The raw sample values from the SQL table would be processed, the compressed version would be stored elsewhere (possibly also in SQL database, as a blob), and later the raw samples can be deleted to save the database space.

Is there any well-known and widely used approach to the problem?

pepr
  • 20,112
  • 15
  • 76
  • 139
  • 1
    Most (all?) SQL databases support transparent compression of data, ideally even columnar storage so that data correlations can be more efficiently utilized. But how much data are we talking about here? Also for some use cases you don't need to store raw historical data, aggregations such as mean and percentiles could be enough for your use case. And if historical data doesn't need to be readily available you could just dump it into a compressed file and upload to for example Amazon S3. – NikoNyrh Jan 04 '17 at 22:37

1 Answers1

2

A simple approach would be code the temperature into a byte or two bytes, depending on the range and precision you need, and then to write the first temperature to your output, followed by the difference between temperatures for all the rest. For two-byte temperatures you can restrict the range some and write one or two bytes depending on the difference with a variable-length integer. E.g. if the high bit of the first byte is set, then the next byte contains 8 more bits of difference, allowing for 15 bits of difference. Most of the time it will be one byte, based on your description.

Then take that stream and feed it to a standard lossless compressor, e.g. zlib.

Any lossiness should be introduced at the sampling step, encoding only the number of bits you really need to encode the required range and precision. The rest of the process should then be lossless to avoid systematic drift in the decompressed values.

Subtracting successive values is the simplest predictor. In that case the prediction of the next value is the value before it. It may also be the most effective, depending on the noisiness of your data. If your data is really smooth, then you could try a higher-order predictor to see if you get better performance. E.g. a predictor for the next point using the last two points is 2a - b, where a is the previous point and b is the point before that, or using the last three points 3a - 3b + c, where c is the point before b. (These assume equal time steps between each.)

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • This is a good idea! The signal (float restricted to min, max interval) can be quantized to integer. 64 K steps between min and max leads to fairly good precision for many industrial signals like temperature, pressure, etc. Then delta encoding ensures small values if the value does not change too quickly or when sampling is required to be very frequent. I guess the key will be to choose a suitable predictor function. Is there any theory behind how to choose it correctly? (Possibly after the data analysis?) – pepr Jan 05 '17 at 14:59
  • "So Long And Thanks for All the ... zlib" :) – pepr Jan 05 '17 at 15:01
  • 1
    Just try the predictors in my answer with your data, and see which gives the smallest output after compressing. – Mark Adler Jan 05 '17 at 16:18
  • 1
    If you are actually using a database for storage, realize that if you utilize a "previous value dependent predictor algorithm", then in order to understand the actual value for record N, you must query ALL the records from 1..N-1 to calculate what record N's value means. This defeats the purpose of using a database for lookups. Analog signals have a limited range, so pick a delta at the middle of the signal and break it in to 256 (SIGNED BYTE) or 65536 (SIGNED WORD) "ticks" for desired resolution. You can then calculate engineering units from the middle/tick resolution values. – franji1 Jan 05 '17 at 18:22
  • @franji1 (actually, you mean UNSIGNED :), storing the "raw" samples in the database is the result of the way how the data is obtained. I cannot change the way. What I want is rather some post-processing to of some of the data if it makes sense (say the temperature in time). No need for lookup in my case. I can use a database cursor to get the values in sequence. (No problem here. Also the later reconstruction of the samples need not to be extremely fast.) – pepr Jan 05 '17 at 19:32
  • @MarkAdler: I see. The 2a - b describes linear interpolation based on two previous samples. I was also able to find why 3a - 3b + c is this way -- based on the constant difference between consequent deltas (sorry for my English -- would be constant derivation if the signal was analogue). I will write another question on when to quantize -- before prediction or after prediction. It seems that "after" may lead to better results, but there may also be some problems. I am getting first results, but I need to study more :) – pepr Feb 20 '17 at 22:16