1

I have to write a tool that manages very large data sets (well, large for an ordinary workstations). I need basically something that works the opposite that the jpeg format. I need the dataset to be intact on disk where it can be arbitrarily large, but then it needs to be lossy compressed when it gets read in memory and only the sub-part used at any given time need to be uncompressed on the flight. I have started looking at ipp (Intel Integrated Performance Primitives) but it's not really clear for now if I can use them for what I need to do. Can anyone point me in the right direction?

Thank you.

Luc Touraille
  • 79,925
  • 15
  • 92
  • 137
  • Are you basically looking for an implementation of the JPEG codec? What language are you working in? – Oliver Charlesworth Jan 13 '12 at 22:05
  • 4
    What kind of data are involved? Image compression is very different than text, forex. – ergosys Jan 13 '12 at 22:07
  • This sounds like it will be better at the statistics SE site. data sampling, transformation, summary, analysis,etc are all in the statistical domain. If the data is of a particular nature, e,g, images, audio, text,etc then more specialized methods would apply. – Iterator Jan 13 '12 at 22:30
  • What exactly does "lossy" mean in this context. What kind of data do you think can be lost? – svick Jan 13 '12 at 22:52
  • The compression/decompression need to be as fast as possible so I'm going to use C++. Data is going to be big matrices > 5GB and arrays. – user1148583 Jan 14 '12 at 01:16
  • I have been doing a little more research and it seems that Principal Component Analysis is a way to "lossily" compress data. If I'm compressing a matrix at loading time I'm not expecting to get the same matrix but one that behave similarly within a certain threshold. I have found this paper that compress matrices to drives computer animation as an example.http://naml.us/~irving/papers/seo2011_compression.pdf – user1148583 Jan 14 '12 at 01:29
  • Is your "lossy" concern only for "fast" access to big data? – Osman Turan Jan 14 '12 at 10:45
  • No, it needs to be lossy because I'm already compressing the data with a non-lossy algorithm and it is not enought. – user1148583 Jan 15 '12 at 05:15
  • You really need to tell us more about the data and how it's accessed before anyone can provide a useful answer. – Nick Johnson Jan 16 '12 at 02:26

1 Answers1

0

Given the nature of your data, it seems you are handling some kind of raw sample. So the easiest and most generic "lossy" technique will be to drop the lower bits, reducing precision, up to the level you want.

Note that you will need to "drop the lower bits", which is quite different from "round to the next power of 10". Computer work on base 2, and you want all your lower bits to be "00000" for compression to perform as well as possible. This method suppose that the selected compression algorithm will make use of the predictable 0-bits pattern.

Another method, more complex and more specific, could be to convert your values as an index into a table. The advantage is that you can "target" precision where you want it. The obvious drawback is that the table will be specific to a distribution pattern.

On top of that, you may also store not the value itself, but the delta of the value with its preceding one if there is any kind of relation between them. This will help compression too.

For data to be compressed, you will need to "group" them by packets of appropriate size, such as 64KB. On a single field, no compression algorithm will give you suitable results. This, in turn, means that each time you want to access a field, you need to decompress the whole packet, so better tune it depending on what you want to do with it. Sequential access is easier to deal with in such circumstances.

Regarding compression algorithm, since these data are going to be "live", you need something very fast, so that accessing the data has very small latency impact.

There are several open-source alternatives out there for that use. For easier license management, i would recommend a BSD alternative. Since you use C++, the following ones look suitable : http://code.google.com/p/snappy/ and http://code.google.com/p/lz4/

Cyan
  • 13,248
  • 8
  • 43
  • 78