23

I want to write a storage backend to store larger chunks of data. The data can be anything, but it is mainly binary files (images, pdfs, jar files) or text files (xml, jsp, js, html, java...). I found most of the data is already compressed. If everything is compressed, about 15% disk space can be saved.

I am looking for the most efficient algorithm that can predict with high probability that a chunk of data (let's say 128 KB) can be compressed or not (lossless compression), without having to look at all the data if possible.

The compression algorithm will be either LZF, Deflate, or something similar (maybe Google Snappy). So predicting if data is compressible should be much faster than compressing the data itself, and use less memory.

Algorithms I already know about:

  • Try to compress a subset of the data, let's say 128 bytes (this is a bit slow)

  • Calculate the sum of 128 bytes, and if it's within a certain range then it's likely not compressible (within 10% of 128 * 127) (this is fast, and relatively good, but I'm looking for something more reliable, because the algorithm really only looks at the topmost bits for each byte)

  • Look at the file headers (relatively reliable, but feels like cheating)

I guess the general idea is that I need an algorithm that can quickly calculate if the probability of each bit in a list of bytes is roughly 0.5.

Update

I have implemented 'ASCII checking', 'entropy calculation', and 'simplified compression', and all give good results. I want to refine the algorithms, and now my idea is to not only predict if data can be compressed, but also how much it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers... I will accept the answer that gave the best results.

Additional answers (new ideas) are still welcome! If possible, with source code or links :-)

Update 2

A similar method is now implemented in Linux.

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
  • You could try a statiscal approach (which you've apparently already considered) or make some estimates beforehand given the file type. I'd go for the second option and improve on that. – James P. Aug 11 '11 at 13:51
  • Well, yes, but which statistical approach exactly? – Thomas Mueller Aug 11 '11 at 13:55
  • Where do you get your 128 bytes? If that portion is having more varied data (header information / magic numbers / other) than other locations (a large expanse of pixels all the same color / 3000 paragraph breaks / a group of children with the same age), then your prediction might say no (the 128 bytes I saw were rich and full of content and not compressible), when the bulk of the data is amenable. Just curious :) – Atreys Aug 11 '11 at 13:58
  • There's various approaches possible in pure statistics. One obvious one is selecting random elements. – James P. Aug 11 '11 at 14:00
  • 1
    I found that looking at the first 128 bytes is enough to get a good prediction if a block is compressible. But it's really just an example. – Thomas Mueller Aug 11 '11 at 14:02
  • If parsing a mere 128 bytes of data is too slow for you, I can't imagine a reliable method that would work fast enough. – tskuzzy Aug 11 '11 at 14:05
  • 1
    If you are mostly dealing with complete files than I would go with the file headers, maybe even only the first 4 bytes, to identify known compressed file formats. The linux file command uses a very extensive database of magic patterns from which you could extract the needed information. – Jörn Horstmann Aug 11 '11 at 14:46
  • @Jörn Unfortunately, it's not always complete files. Sometimes it's just chunks of files (large files are split into chunks). – Thomas Mueller Aug 11 '11 at 15:15
  • 1
    @Thomas Not sure about first 128 bytes. Some formats have compressible header here. E.g. JAR/zip file has list of file names, which is mostly like plain text, but then it has compressed content. Maybe it worth to sample several small blocks across the whole block. – kan Aug 11 '11 at 16:18
  • @kan You are right. Instead of smaller blocks, I guess using an offset of (let's say) 1 KB will help. Picking every xth byte might work as well, but isn't that cache-efficient. – Thomas Mueller Aug 11 '11 at 17:06

9 Answers9

8

Calculate the entropy of the data. If it has high entropy (~1.0), it is not likely going to be further compressed. If it has low entropy (~0.0), then that means that there isn't a lot of "information" in it and can be further compressed.

It provides a theoretical measure of how compressed a piece of data can get.

tskuzzy
  • 35,812
  • 14
  • 73
  • 140
  • Yes, and how to do that efficiently? – Thomas Mueller Aug 11 '11 at 13:58
  • What do you mean? Just go through the file and calculate it. It's not an expensive operation. – tskuzzy Aug 11 '11 at 13:59
  • 2
    The entropy is only a good measure for some simple compression techniques, e.g. using plain Huffman coding. Commonly used compression formats (gzip, bzip, lzma) use much more complex algorithms, so the entropy alone is not usable to determine if the data can be compressed. – jarnbjo Aug 11 '11 at 14:00
  • 3
    @jarnbjo: It's a measure of what the BEST compression technique can achieve. I don't understand how that's not enough. No matter how complex the algorithm, it can't do better than the entropy of the data. – tskuzzy Aug 11 '11 at 14:03
  • @jarnbjo What techniques are you talking about? Can't imagine a general purpose encoder that could beat the entropy of data - I mean how would that work? – Voo Aug 11 '11 at 14:04
  • @tskuzzy: Sure, to be more exact, it depends on your (the compressor's) exact understanding of entropy or how you define the input alphabet for the entropy calculation. Simple byte counting is not enough. – jarnbjo Aug 11 '11 at 14:13
  • @jarnbjo: Entropy will tell you if your data **can** be further compressed (which is what the OP asked). It is true that it does not tell you whether or not your specified compression algorithm will be able to achieve that compression. – tskuzzy Aug 11 '11 at 14:15
  • 1
    @tskuzzy: Please enlighten me how to perform this magic and generic entropy calculation. What is e.g. the entropy of the byte sequence 0 .. 255 repeated 1000 times (according to your understanding of entropy)? – jarnbjo Aug 11 '11 at 14:25
  • @tskuzzy sounds like a good idea, however processing 1 byte at a time seems to require an array of 256 counters, which sounds a bit slow - I will try that, and try if processing 4 or 2 bits at a time is faster (with 16 or just 4 counters, in registers) – Thomas Mueller Aug 11 '11 at 14:25
  • Have you given some consideration to the other approach suggested? In other words, take a sample of files of each type and see what compression average you can get. – James P. Aug 11 '11 at 14:51
  • 2
    jarnbjo is right. You seem to imply that to calculate the (real) entropy" of the data is straighforward, but it's not; you need to make assumptions, eg. that the bytes are independent. But a file could have all its bytes equiprobable but still have high redundance (low entropy). And, precisely, compressors like gzip exploit that kind of redundance, and that's difficult to measure (as a probabilistic model) – leonbloy Nov 11 '13 at 02:05
8

From my experience almost all of the formats that can effectively be compressed are non-binary. So checking if about 70-80% of the characters are within in the [0-127] rage should do the trick.

If you want to to it "properly" (even though I really can't see a reason to do that), you either have to run (parts of) your compression algorithm on the data or calculate the entropy, as tskuzzy already proposed.

Chronial
  • 66,706
  • 14
  • 93
  • 99
  • This sounds like what I was thinking about so far. Calculating the sum of 128 bytes is an easy way to check if most characters are within [0-127], this I already have. Using a simplified compression algorithm (one without output) is also a good idea, and is probably a bit better than calculating the entropy (which is something I didn't think about so far). – Thomas Mueller Aug 11 '11 at 15:12
  • I have implemented 'ASCII checking', 'entropy calculation', and 'simplified compression' (see my answer below), and all give good results. I want to refine the algorithms, and now my idea is to not only predict _if_ data can be compressed, but also _how much_ it can be compressed. Possibly using a combination of algorithms. Now if I could only accept multiple answers... I will accept the answer that gave the best results. – Thomas Mueller Aug 13 '11 at 09:54
  • I will use the partial compression algorithm (see answer below). I found the entropy calculation is also good, but not quite as good, and sometimes a bit slower. – Thomas Mueller Aug 17 '11 at 19:21
8

I implemented a few methods to test if data is compressible.

Simplified Compression

This basically checks for duplicate byte pairs:

static boolean isCompressible(byte[] data, int len) {
    int result = 0;
    // check in blocks of 256 bytes, 
    // and sum up how compressible each block is
    for (int start = 0; start < len; start += 256) {
        result += matches(data, start, Math.min(start + 255, len));
    }
    // the result is proportional to the number of 
    // bytes that can be saved
    // if we can save many bytes, then it is compressible
    return ((len - result) * 777) < len * 100;
}

static int matches(byte[] data, int i, int end) {
    // bitArray is a bloom filter of seen byte pairs
    // match counts duplicate byte pairs
    // last is the last seen byte
    int bitArray = 0, match = 0, last = 0;
    if (i < 0 || end > data.length) {
        // this check may allow the JVM to avoid
        // array bound checks in the following loop
        throw new ArrayIndexOutOfBoundsException();
    }
    for (; i < end; i++) {
        int x = data[i];
        // the bloom filter bit to set
        int bit = 1 << ((last ^ x) & 31);
        // if it was already set, increment match
        // (without using a branch, as branches are slow)
        match -= (-(bitArray & bit)) >> 31;
        bitArray |= bit;
        last = x;
    }
    return match;
}

On my (limited) set of test data, this algorithm is quite accurate. It about 5 times faster than compressing itself if the data is not compressible. For trivial data (all zeroes), it is about half as fast however.

Partial Entropy

This algorithm estimates the entropy of the high nibbles. I wanted to avoid using too many buckets, because they have to be zeroed out each time (which is slow if the blocks to check are small). 63 - numberOfLeadingZeros is the logarithm (I wanted to avoid using floating point numbers). Depending on the data, it is faster or slower than the algorithm above (not sure why). The result isn't quite as accurate as the algorithm above, possibly because of using only 16 buckets, and only integer arithmetic.

static boolean isCompressible(byte[] data, int len) {
    // the number of bytes with 
    // high nibble 0, 1,.., 15
    int[] sum = new int[16];
    for (int i = 0; i < len; i++) {
        int x = (data[i] & 255) >> 4;
        sum[x]++;
    }
    // see wikipedia to understand this formula :-)
    int r = 0;
    for (int x : sum) {
        long v = ((long) x << 32) / len;
        r += 63 - Long.numberOfLeadingZeros(v + 1);
    }
    return len * r < 438 * len;
}
Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
  • Wow, I'm not afraid of bit-twiddling but that screams for some comments or named constants, especially the "2777" part. Also, won't last*2777 overflow? – Jörn Horstmann Aug 12 '11 at 12:39
  • This is just the initial version... The final version will have comments. The code is similar to LZF and other compression algorithms. last*2777 is _supposed_ to overflow, as it's a hash function. – Thomas Mueller Aug 12 '11 at 14:44
  • thanks for the explanation, using a hash function of a group of bytes and counting repeated values is a nice idea. – Jörn Horstmann Aug 12 '11 at 17:14
  • I guess now you have a final version of these methods, can you update the answer? – Rui Marques Mar 14 '14 at 11:16
  • I don't actually use the code yet (in the [MVStore](http://h2database.com/html/mvstore.html)), but I have tried to comment the code. – Thomas Mueller Mar 16 '14 at 18:20
4

This problem is interesting alone because with for example zlib compressing uncompressible data takes much longer then compressing compressible data. So doing unsuccessful compression is especially expensive (for details see the links). Nice recent work in this area has been done by Harnik et al. from IBM.

Yes, the prefix method and byte order-0 entropy (called entropy in the other posts) are good indicators. Other good ways to guess if a file is compressable or not are (from the paper):

  • Core-set size – The character set that makes up most of the data
  • Symbol-pairs distribution indicator

Here is the FAST paper and the slides.

dmeister
  • 34,704
  • 19
  • 73
  • 95
2

A faster and more accurate algorithm for estimating compressibility

  1. 2 to 4 times faster and more accurate answer than judging by Shannon's entropy. It is based on the Huffman coding approach.
  2. Time complexity of answering does not depend on numerical values of frequency of symbols, rather depends on the number of unique symbols. Shannon's entropy calculates log(frequency), Hence the more the frequency, the more time it will consume to compute this value. In the current approach, mathematical operations on frequency values are avoided.
  3. For similar reasons as above, precision is also higher since our dependency on floating point operations is avoided as well as we just rely on sum and multiplication operations and how actual Huffman codes will contribute to total compressed size.
  4. Same algorithm can be enhanced to generate actual Huffman codes in less time which won't involve complex data structures like trees, heaps or priority queue. For our different requirements we just use the same frequency array of symbols.

Following algorithm specifies how to calculate compressibility of a file whose symbol frequency values are stored in map array.Time comparison chart

     int compressed_file_size_in_bits = 0, n=256;
  /* We sort the map array in increasing order.
   * We will be simulating huffman codes algorithm.
   * Insertion Sort is used as its a small array of 256 symbols.
   */
  insertionSort(map, 256);

  for (j = 0; j < n; j++)
    if (map[j] != 0)
      break;

  for (i = j; i + 1 < n; i++) {
    j = i + 1;
    /* Following is an important step, as we keep on building more
     * and more codes bottom up, their contribution to compressed size
     * gets governed by following formula. Copy pen simulation is recommended.
     */
    compressed_file_size_in_bits = compressed_file_size_in_bits + map[i] + map[j];

    /* Least two elements of the map gets summed up and form a new frequency
     * value which gets placed at i+1 th index.
     */
    map[i + 1] = map[i] + map[j];
    // map [i+2-----] is already sorted. Just fix the first element.
    Adjust_first_element(map + i + 1, n - i - 1);
  }
  printf("Entropy per byte %f ", compressed_file_size_in_bits * (1.0) / file_len);

  void insertionSort(long arr[], long n) {
  long i, key, j;
  for (i = 1; i < n; i++) {
    key = arr[i];
    j = i - 1;

    /* Move elements of arr[0..i-1], that are
    greater than key, to one position ahead
    of their current position */
    while (j >= 0 && arr[j] > key) {
      arr[j + 1] = arr[j];
      j = j - 1;
    }
    arr[j + 1] = key;
  }
}

// Assumes arr[i+1---] is already sorted. Just first
// element needs to be placed at appropriate place.
void Adjust_first_element(long arr[], long n) {
  long i, key, j = 1;
  key = arr[0];
  while (j < n && arr[j] < key) {
    arr[j - 1] = arr[j];
    j = j + 1;
  }
  arr[j - 1] = key;
}

Construction of codes using above algorithm

Construction of codes which makes use of the above algorithm is a string manipulation problem where we start with no code for each symbol. Then we follow the same algorithm as compressed file size/ compressibility calculation. Additionally we just keep on maintaining the history of code evolution. After the iteration through frequency array finishes, our final code which contains the evolution of different Huffman code for each symbol gets stored in the top index of codes array. At this point, a string parsing algorithm can parse this evolution and generate individual codes per symbol. Complete phenomenon involves no trees, heaps or priority queue. Just one iteration through frequency array (size 256 in most cases) would generate the evolution of codes as well as final compressed size value.

   /* Generate code for map array of frequencies. Final code gets generated at
 * codes[r] which can be provided as input to string parsing algorithm to
 * generate code for individual symbols.
 */
void generate_code(long map[], int l, int r) {
  int i, j, k, compressed_file_size_in_bits = 0;

  insertionSort(map + l, r - l);

  for (i = l; i + 1 <= r; i++) {
    j = i + 1;

    compressed_file_size = compressed_file_size_in_bits + map[i] + map[j];
    char code[50] = "(";

    /* According to  algorithm, two different codes from two different
     * nodes are getting combined in a way so that they can be separated by
     * by a string parsing algorithm.  Left node code, codes[i] gets appended by
     * 0  and right node code, codes[j] gets appended by 1. These two codes 
      get
     * separated by a comma.
     */
    strcat(code, codes[i]);
    strcat(code, "0");
    strcat(code, ",");
    strcat(code, codes[j]);
    strcat(code, "1");
    strcat(code, ")");

    map[i + 1] = map[i] + map[j];

    strcpy(codes[i + 1], code);

    int n = r - l;
    /* Adjust_first_element now takes an additional 3rd argument.
     * this argument helps in adjusting codes according to how
     * map elements are getting adjusted.
     */
    Adjust_first_element(map + i + 1, n - i - 1, i + 1);
  }
}

void insertionSort(long arr[], long n) {
  long i, key, j;
  //   if(n>3)
  //  n=3;
  for (i = 1; i < n; i++) {
    key = arr[i];
    j = i - 1;

    /* Move elements of arr[0..i-1], that are
    greater than key, to one position ahead
    of their current position */
    while (j >= 0 && arr[j] > key) {
      arr[j + 1] = arr[j];
      j = j - 1;
    }
    arr[j + 1] = key;
  }
}

// Assumes arr[i+1---] is already sorted. Just first
// element needs to be placed at appropriate place.
void Adjust_first_element(long arr[], long n, int start) {
  long i, key, j = 1;
  char temp_arr[250];
  key = arr[0];
  /* How map elements will change position, codes[] element will follow
   * same path
   */
  strcpy(temp_arr, codes[start]);

  while (j < n && arr[j] < key) {
    arr[j - 1] = arr[j];
    /* codes should also move according to map values */
    strcpy(codes[j - 1 + start], codes[j + start]);
    j = j + 1;
  }
  arr[j - 1] = key;
  strcpy(codes[j - 1 + start], temp_arr);
}
sweetesh
  • 21
  • 1
1

I expect there's no way to check how compressible something is until you try to compress it. You could check for patterns (more patterns, perhaps more compressible), but then a particular compression algorithmn may not use the patterns you checked for - and may do better than you expect. Another trick may be to take the first 128000 bytes of data, push it through Deflate/Java compression, and see if it's less than the original size. If so - chances are it's worthwhile compressing the entire lot.

cs94njw
  • 535
  • 5
  • 12
1

Fast compressor such as LZ4 already have built-in checks for data compressibility. They quickly skip the bad segments to concentrate on more interesting ones. To give a proper example, LZ4 on non-compressible data works at almost RAM speed limit (2GB/s on my laptop). So there is little room for a detector to be even faster. You can try it for yourself : http://code.google.com/p/lz4/

Cyan
  • 13,248
  • 8
  • 43
  • 78
  • I disagree, there is a lot of room for a detector to be faster. LZ4 is similar to LZO, LZF and Snappy, and I already know how fast they are. All those compression algorithms detect uncompressible blocks, but they do that relatively slowly. – Thomas Mueller Sep 25 '11 at 07:26
  • "Slowly" sounds like a harsh over-statement. The best ones (i do not include LZF in the list) already work at RAM speed limit on not compressible data, and that's while doing there job of providing a proper output (mostly a duplicate of the input if it is not compressible). Just remove the output to provide a compression counter stat instead, and this is probably as fast as it can be. – Cyan Sep 25 '11 at 17:45
  • As I wrote, I'm interested in algorithms for Java. Please note I wrote "relatively slowly", not just "slowly": according to my tests, the Java version of the fastest compression algorithm is much slower than my algorithm (that doesn't generate output and doesn't use a real hash table). – Thomas Mueller Oct 09 '11 at 15:23
  • Lz4 is fast but snappy usually beats lz4 on the same incompressible data. – u0b34a0f6ae Feb 16 '12 at 01:50
0

It says on your profile that you're the author of the H2 Database Engine, a database written in Java.

If I am guessing correctly, you are looking to engineer this database engine to automatically compress BLOB data, if possible.

But -- (I am guessing) you have realized that not everything will compress, and speed is important -- so you don't want to waste a microsecond more than is necessary when determining if you should compress data...

My question is engineering in nature -- why do all this? Basically, isn't it second-guessing the intent of the database user / application developer -- at the expense of speed?

Wouldn't you think that an application developer (who is writing data to the blob fields in the first place) would be the best person to make the decision if data should be compressed or not, and if so -- to choose the appropriate compression method?

The only possible place I can see automatic database compression possibly adding some value is in text/varchar fields -- and only if they're beyond a certain length -- but even so, that option might be better decided by the application developer... I might even go so far as to allow the application developer a compression plug-in, if so... That way they can make their own decisions for their own data...

If my assumptions about what you are trying to do were wrong -- then I humbly apologize for saying what I said... (It's just one insignificant user's opinion.)

Peter Sherman
  • 304
  • 2
  • 3
  • 1
    This feature is actually something I'm looking at for Jackrabbit 3 and not for H2. It's not at the expense of speed (that's the plan). Doing a few simple in-memory calculations is faster than storing data to disk. And if the data _can_ be compressed a lot, then compressing + storing the compressed file can be faster than just storing the uncompressed file. – Thomas Mueller Aug 11 '11 at 15:08
  • Why not compress data in the background? If there are CPU cycles available, you could have a thread which looks for compressible objects, attempts to compress them, if it's successful, writes them back as compressed, if not, skips them and moves on... Thread checks CPU state and hibernates if the CPU is busy... ? – Peter Sherman Aug 12 '11 at 02:41
  • Compressing in the background is a good idea, but it's much more complicated, specially because the data store is supposed to be immutable. Replacing the data with the compressed version is tricky because the data store could concurrently be accessed from within another process. Also, the overall throughput would be lower if data needs to be stored twice. – Thomas Mueller Aug 13 '11 at 09:46
0

Also -- Why not try lzop? I can personally vouch for the fact that it's faster, much faster (compression and decompression) than bzip, gzip, zip, rar...

http://www.lzop.org

Using it for disk image compression makes the process disk-IO bound. Using any of the other compressors makes the process CPU-bound (i.e., the other compressors use all available CPU, lzop (on a reasonable CPU) can handle data at the same speed a 7200 RPM stock hard drive can dish it out...)

I'll bet if you tested it with the first X bytes of a 'test compression' string, it would be much faster than most other methods...

Peter Sherman
  • 304
  • 2
  • 3
  • Do you know LZF and Snappy? They are in the same category as LZO. Could you point me to an open source, pure Java implementation of LZO _compression_ algorithm (the one available isn't open source)? – Thomas Mueller Aug 12 '11 at 06:05
  • I found a version of LZO, but it's written in C, and converted to Java using a special preprocessor / converter. I also found some benchmark results that indicate LZF (pure Java, source code available) and Snappy (native!) are about as fast as LZO. – Thomas Mueller Aug 13 '11 at 09:49