Is it possible to compress binary files with Huffman-encoding?

Question

My homework for summer is writing a huffman compression program. I searched a lot, but i don't know we can use it for every file format or just for text files. i think it's possible, but rather i ask here.

Am image file can be a binary file and the JPEG and MPEG formats use Huffman codes. A Huffman code does not need to be 8 or 16 bits, but can be any number within a bitstream. — Weather Vane, Jun 22 '19 at 08:49
Build it and try it - the doubt does not stop you from completing your homework. — Clifford, Jun 22 '19 at 09:19
There's lots of repetitive and redundant information in the typical binary. Such repetition is the what makes compression possible. I just took a 3 KiB `.o` file on a Mac and compressed it: `pins31.o: 2.036:1, 3.929 bits/byte, 50.88% saved, 3280 in, 1611 out.` — that's about 50% savings. On a 92 KiB file, `gzip` got 60.1% savings; `bzip2` got a 58.71% saving (surprising; it normally does better than `gzip`); `lzip` got 66.93% savings; `xz` got 67.07% savings. And I see I need to upgrade; my `lz`/`xz` binaries were built in 2010. — Jonathan Leffler, Jun 23 '19 at 02:55

Eric Postpischil · Answer 1 · 2019-06-23T00:53:28.540

As far as the mechanics of reading data from an input file and writing data to an output file are concerned, there are no impediments to applying a Huffman encoding algorithm to a binary file. One simply reads bytes, operates on them, and writes bytes.

As far as whether a Huffman encoding algorithm will make a binary file smaller, there are issues about information content and probability distributions. Any compression scheme attempts to reduce the data used by taking advantage of patterns in the data. For example, when there are repeated sequences of bytes, they may be replaced by shorter codes that represent them.

Text files are generally very compressible because natural human language is not arbitrary data but uses a limited set of characters, has many patterns in the characters, and has many repeated parts. “Binary files” can be anything. Much of the data we store in binary files does have patterns and is compressible to some extent, but some of the data may be very dense in information content and not have patterns usable by a compression algorithm.

It is impossible for any lossless compression algorithm to compress every file. If a compression algorithm always produced a smaller file, we could run it again on the smaller file to get an even smaller file, and repeating that would eventually reduce the file size to zero.

So any compression algorithm must fail to make some files shorter. In fact, since there are a fixed number of files of a given length and smaller, if it makes any files smaller, it must make some files larger.

A simpler proof that there is no lossless compression algorithm to compress every file is to consider applying the algorithm recursively to its own output. Eventually you would end up with nothing! — Dipstick, Jun 22 '19 at 10:41

score 1 · Answer 2 · answered Jun 22 '19 at 09:16

A "text file" is just a binary file with a particular interpretation put upon it which software will render in a human readable presentation. The compressability of any content using Huffman encoding depends on the frequency distribution of particular byte values (or other word sizes possibly).

Text files for most languages use a restricted character set and have very uneven frequency distribution, so tend to be very compressible. Other file types will vary with the nature of both the format and the specific content.

Is it possible to compress binary files with Huffman-encoding?

2 Answers2