5

What is the fundamental difference between tarring a folder using tar on Unix and tarfile in Python that results in a different file size?

In the example below, there is an 8.2 MB difference. I'm currently using a Mac. The folder in this example contains a bunch of random text files for testing purposes.

tar -cvf archive_unix.tar files/

python -m tarfile -c archive_pycli.tar files/ # using Python 3.9.6

-rw-r--r--  1 userid  staff  24606720 Oct 15 09:40 archive_pycli.tar
-rw-r--r--  1 userid  staff  16397824 Oct 15 09:39 archive_unix.tar
Simon1
  • 445
  • 4
  • 12
  • 1
    The first step would be to run `tar -tvf` on both archives to see what differences might exist between their contents. – jasonharper Oct 15 '21 at 14:03
  • I have already done this and both are identical. I created two files containing the contents of each file, and then compared those contents. The same number of files exist and all of the file sizes are identical. – Simon1 Oct 15 '21 at 14:25
  • The one possibility that comes to mind is that you're dealing with *sparse files* - files with sufficiently long runs of null bytes that entire disk blocks can be omitted from their storage. Some `tar` implementations preserve sparseness, some don't. However, that's not compatible with your description of these as "random text files", since a text file shouldn't contain null bytes at all. – jasonharper Oct 15 '21 at 14:41

1 Answers1

6

Interesting question. The documentation of tarfile (https://docs.python.org/3/library/tarfile.html) mentions that the default format for tar archive created by tarfile is, since python 3.8, PAX_FORMAT whereas archives created by the tar command have the GNU format which I believe explains the difference.

Now to produce the same archive as the tar command and one with the default format (as your command did):

import tarfile
with tarfile.TarFile(name='archive-py-gnu.tar', mode='w', format=tarfile.GNU_FORMAT) as tf:
    tf.add('tmp')
with tarfile.TarFile(name='archive-py-default.tar', mode='w') as tf:
    tf.add('tmp')

For comparison:

$ tar cf archive-tar.tar tmp/
$ ls -l 
3430400 16:28 archive-py-default.tar
3317760 16:28 archive-py-gnu.tar
3317760 16:27 archive-tar.tar

Results of the file command:

$ file archive_unix.tar
archive_unix.tar: POSIX tar archive (GNU)
$ file archive-py-gnu.tar
archive-py-gnu.tar: POSIX tar archive (GNU)
$ file archive-py-default.tar
archive-py-default.tar: POSIX tar archive

Now I cannot tell you the difference between the different formats, sorry. But I hope this helps.

qouify
  • 3,698
  • 2
  • 15
  • 26
  • I appreciate you taking the time to help me out. Your answer lead me to finding out you can change the file format when using ```tar```, so I gave that a try and was able to confirm the size difference was due to different archive formats. – Simon1 Oct 15 '21 at 15:00
  • @Simon1 Glad I could help. I often use `tarfile` but never noticed this behaviour so your question interested me. – qouify Oct 15 '21 at 16:47