When does it become worthwhile to spend the execution time to zip files?

Question

We are using the #ziplib (found here) in an application that synchronizes files from a server for an occasionally connected client application.

My question is, with this algorithm, when does it become worthwhile to spend the execution time to do the actual zipping of files? Presumably, if only one small text file is being synchronized, the time to zip would not sufficiently reduce the size of the transfer and would actually slow down the entire process.

Since the zip time profile is going to change based on the number of files, the types of files and the size of those files, is there a good way to discover programmatically when I should zip the files and when I should just pass them as is? In our application, files will almost always be photos though the type of photo and size may well change.

I havent written the actual file transfer logic yet, but expect to use System.Net.WebClient to do this, but am open to alternatives to save on execution time as well.

UPDATE: As this discussion develops, is "to zip, or not to zip" the wrong question? Should the focus be on replacing the older System.Net.WebClient method with compressed WCF traffic or something similar? The database synchronization portion of this utility already uses Microsoft Synchronization Framework and WCF, so I am certainly open to that. Anything we can do now to limit network traffic is going to be huge for our clients.

At least if they are stored in an already compressed format such as jpeg or png. Uncompressed Bitmaps/TIFs on the other hand can be compressed a bit. — CodesInChaos, Nov 02 '11 at 12:40
I think the if compression is useful mainly depends on available CPU power compared to upload bandwidth. Looking at the horrible upload rates consumer internet has in many countries, even small compression rates could be a win. — CodesInChaos, Nov 02 '11 at 12:42
Odds are low, especially if it is a http transfer which is already routinely gzipped. Make it work without it first, now you can actually test and compare in version 1.1 — Hans Passant, Nov 02 '11 at 12:53
@adrianm the utility I am developing is actually going into an internal framework to be used by multiple applications. When I said the application is dealing only with photos, that was rather short sighted. We have other applications that synchronize technical documentation and emergency resources that will be in various text formats as well that will eventually be using this new model as well. — Michael Kingsmill, Nov 02 '11 at 12:54

score 2 · Accepted Answer · answered Nov 02 '11 at 12:49

To determine whether it's useful to compress a file, you have to read the file anyway. When on it, you might as well zip it then.

If you want to prevent useless zipping without reading the files, you could try to decide it on beforehand, based on other properties.

You could create an 'algorithm' that decides whether it's useful, for example based on file extention and size. So, a .txt file of more than 1 KB can be zipped, but a .jpg file shouldn't, regardless of the file size. But it's a lot of work to create such a list (you could also create a black- or whitelist and allow c.q. deny all files not on the list).

score 1 · Answer 2 · answered Nov 02 '11 at 12:41

1

You probably have plenty of CPU time, so the only issue is: does it shrink?

If you can decrease the file you will save on (Disk and Network) I/O. That becomes profitable very quickly.

Alas, photos (jpeg) are already compressed so you probably won't see much gain.

answered Nov 02 '11 at 12:41

H H

263,252
30
330
514

Network traffic is the big concern here as we are dealing with people on air cards and phone networks in alot of instances. Perhaps, what should be drawing more attention is the transfer method. Do you have any thoughts on compressed WCF traffic vs. zipping the files at all? – Michael Kingsmill Nov 02 '11 at 13:05

sll · Answer 3 · 2011-12-15T09:59:42.870

You can write your own pretty simple heuristic analysis and then reuse it whilst each next file processing. Collected statistics should be saved to keep efficiency between restarts.

Basically interface:

enum FileContentType
{
  PlainText,
  OfficeDoc,
  OffixeXlsx
}

// Name is ugly so find out better
public interface IHeuristicZipAnalyzer
{
   bool IsWorthToZip(int fileSizeInBytes, FileContentType contentType);
   void AddInfo(FileContentType, fileSizeInBytes, int finalZipSize);
}

Then you can collect statistic by adding information regarding just zipped file using AddInfo(...) and based on it can determine whether it worth to zip a next file by calling IsWorthToZip(...)

When does it become worthwhile to spend the execution time to zip files?

3 Answers3