-1

When I calculate the entropy values of files compressed with Gzip, PKZIP, 7ZIP and Winrar, I find that the compression rate of Gzip is higher than the others. The entropy value is higher (indicating less redundancy) and the file size is smaller. Even for small files, the overhead of Gzip is lower compared to the other algorithms. To be fair, this is not the case for all file formats, e.g. for xlsx, 7- ZIP and PKzip have better results than Gzip and Winrar. But still. I'm quite surprised because 7- ZIP is generally considered a better compression algorithm in terms of.... it reduces the file size more, but that does not really correspond with my results. Or I did something completely wrong... or...?

I did not base these results on a few files, I compressed a whole bunch of things from different file formats and calculated the delta of the file sizes with Python.

What I also find quite interesting. When I look at PDF files, I would expect that especially PDF 1.5 or higher can hardly be compressed by a lossless compression algorithm, as they are already heavily compressed by themselves. But I don't see much difference between PDF < 1.5 and 1.5 >, both are compressed quite heavily by these compression tools.

By the way, I used the default algorithms and settings of these archivers

Can someone explain how/why this is the case (maybe I'm doing something wrong) or maybe these results does make sense (but I can't find something on the internet that does support this)?

  • You seem to have not actually asked a programming-related question, or even a question here? – Stu Jul 05 '23 at 16:12
  • Yes to be honest I didn't know on which site I had to post this. I will edit the post to make my question more clear, or do I need to post it on https://cs.stackexchange.com/ or something else? – Questions123 Jul 05 '23 at 16:29
  • Probably this is more applicable to Superuser – Stu Jul 05 '23 at 16:37
  • Cross-posted: https://stackoverflow.com/q/76622167/781723, https://cs.stackexchange.com/q/160999/755. Please [do not post the same question on multiple sites](https://meta.stackexchange.com/q/64068). If you discover you have posted on the wrong site, you can delete the question and post it on another site (but it is usually best to avoid this if you have already received answers; perhaps you might consider flagging for moderator attention, to ask them to migrate it). Make sure to read the help page of the site to figure out what is on-topic before posting or asking for migration. – D.W. Jul 05 '23 at 19:11

3 Answers3

2

"The entropy value is higher (indicating less redundancy) ...". The entropy is relative to a model of the data. If you are using zeroth-order entropy, that can only provide an indication that the data has been compressed (or encrypted), and appears to be random. If the result is close to the number of bits you are measuring, which I'm sure it is in this case, then it can't be used to compare the effectiveness of compression.

"... and the file size is smaller." That's the only way to compare the effectiveness of compression.

The tools you mention all, except for gzip, have several different compression methods they can employ. For each (including gzip), there are levels of compression, i.e. how hard it works at it, that can be specified. If you're going to attempt to benchmark compression methods, you need to at least say what they were and what parameters were given to them.

Though you don't need to bother. There are many that have already been done for you. Google "compression benchmark".

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thanks! and I probably indeed did not use the correct parameters etc. one last question: "If the result is close to the number of bits you are measuring, which I'm sure it is in this case, then it can't be used to compare the effectiveness of compression." Could you please explain this phrase a bit more? Thank you – Questions123 Jul 05 '23 at 19:16
  • 1
    If I measure the zeroth-order entropy of any large compressed file I have, I get values like 7.999296, 7.989287, or 7.999937 bits per byte. Those are all close enough to 8 bits per byte, that they are indistinguishable from random data. – Mark Adler Jul 05 '23 at 20:40
1

We can't answer whether you've done anything wrong without seeing the specifics of your experimental methodology and your results, but here are some general remarks:

The compression rate is likely to depend on what types of files you are compressing. I think you might find it is common that one compression algorithm does better at some types of files than a second algorithm, and worse for other types of files.

Also, there is a tradeoff between compression rate and computation time. Some compressors "try harder" to compress, i.e., they are willing to spend more computation time in exchange for hopefully better compression.

Finally, some compression algorithms are "just better" in that they're likely to perform better across the board on many types of files.

There might be a misunderstanding. Compression rate is defined as the size of the original file divided by the size of the compressed file. The entropy of the compressed file does not affect the compression rate.

D.W.
  • 3,382
  • 7
  • 44
  • 110
  • Thank you sir, was a helpful post! – Questions123 Jul 05 '23 at 19:15
  • May I ask one more question: I just installed 7zip. If I compress a file with 7zip and the output is gzip (so I use the compressor 7zip but the algorithm gzip), is that the same as if I would create a gzip output in a different way (assuming the settings are the same)? – Questions123 Jul 05 '23 at 20:03
  • @Questions123, this site isn't intended or designed for back-and-forths or follow-up questions. If you have an additional question, figure out whether it is on-topic here (hint: it is probably not), then if it is, post it using 'Ask Question', with full context and showing your research. – D.W. Jul 06 '23 at 06:18
0

The tools you compare are file archivers - with the exception of gzip.

File archivers are used to handle more than one file and/or one or more hierarchies of files. Usually, the file formats keep metadata about the individual files, the tools allow operations with any given individual file with much less effort than handling the entire archive.

gzip has been used to un/compress archives handled by (then) non-compressing archivers such as tar(tape archiever, late 1970ies)/pax: solid compression. The metadata can be as low as 18 bytes - for the entire file.

greybeard
  • 2,249
  • 8
  • 30
  • 66
  • To split a hair, *gunzip* was specified to create exactly *one* output. Multiple files specified for gzip got compressed individually and concatenated in the gzip output (& decompressed individually and concatenated in gunzip's output). – greybeard Jul 06 '23 at 04:00