2

I have some tar.gz files that total many gigabytes on a CentOS system. Most of the tar.gz files are actually pretty small, but the ones with images are large. One is 7.7G, another is about 4G, and a couple around 1G.

I have unpacked the files once already and now I want a second copy of all those files.

I assumed that copying the unpacked files would be faster than re-unpacking them. But I started running cp -R about 10 minutes ago and so far less than 500M is copied. I feel certain that the unpacking process was faster.

Am I right?

And if so, why? It doesn't seem to make sense that unpacking would be faster than simply duplicating existing structures.

Buttle Butkus
  • 1,741
  • 8
  • 33
  • 45
  • Do you have the *.gz files in addition to the uncompressed files? Are you making a copy to the same disk? different disk? another system? Is this a one-time affair, i.e. now that you have copied the unpacked files once, you simply want to make a copy locally and that is it? – mdpc Oct 18 '12 at 21:25
  • I have the .gz files. I've already unpacked them once to this server. Now I want a second copy of those files on the same server, same disk. So the idea is that I have a choice - whether to copy what I just unpacked or to unpack from the .gz files again. I just assumed that unpacking would be slower, but to my surprise it seems that copying is MUCH slower. – Buttle Butkus Oct 23 '12 at 22:22

2 Answers2

9

Consider the two scenarios:

  • Copy requires that you read the full file from disk and write it to the disk
  • Tar-Gzip requires that you read a smaller file from disk, decompress, and write it to disk.

If your CPU is not being taxed by the decompression process, it stands to reason that the I/O operations are limiting. By that argument (and since you have to write the same amount in both cases), reading a smaller file (the tar.gz) takes less time than reading a larger file. Also time is saved because it is faster to read a single file than to read many small files.

The time saved is dependent on the difference between the time taken to read (I/O) and decompress (CPU). Therefore, for files which are minimally compressible (e.g. already compressed files such as mp3, jpg, zip, etc.), where the time required for decompression is likely to be greater than the time saved on the read operation, it will in fact be slower to decompress than to copy.

(It is worth noting that the slower the I/O, the more time will be saved by using the compressed file - one such scenario would if the source and target of the copy operation are on the same physical disk.)

cyberx86
  • 20,805
  • 1
  • 62
  • 81
  • Yes, they are on the same physical disk. Good explanation. – Buttle Butkus Oct 18 '12 at 21:47
  • 3
    +1 but... Some file types do not compress well, or at all, and the above would not apply in such a case, where a copy can in fact be faster. Unusual to be sure but it does occur. – John Gardeniers Oct 19 '12 at 00:00
  • Absolutely - I'll add it to the answer. – cyberx86 Oct 19 '12 at 11:56
  • 2
    cp also has to deal with filesystem overhead to read attributes and get directory listings - which for a large group of small files can be significant. tar is just reading from a single file so doesn't have to worry about that. – Grant Oct 19 '12 at 12:32
  • @JohnGardeniers good point. But in my particular case the question is not about the compression/decompression vs. copy, but just decompression. In my case though, I think compression/decompression would be faster than copy. Copy was VERY slow, 85 minutes. I think decompression was more like 30 but I have to check. – Buttle Butkus Oct 22 '12 at 00:41
1

Reading a very small file is much faster than reading a bunch of large files. This is generally true even if the CPU has to decompress it.

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972