3

I'm working on converting my backup script from shell to Python. One of the features of my old script was to check the created tarfile for integrity by doing: gzip -t .

This seems to be a bit tricky in Python.

It seems that the only way to do this, is by reading each of the compressed TarInfo objects within the tarfile.

Is there a way to check a tarfile for integrity, without extracting to disk, or keeping it in memory (in it's entirety)?

Good people on #python on freenode suggested that I should read each TarInfo object chunk-by-chunk, discarding each chunk read.

I must admit that I have no idea how to do this, seeing that I just started Python.

Imagine that I have a tarfile of 30GB which contains files ranging from 1kb to 10GB...

This is the solution that I started writing:

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

for member_info in tardude.getmembers():
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

This code is far from finished. I would not dare running this on a huge 30GB tar archive, because at one point, check would be an object of 10+GB (If i have such huge files within the tar archive)

Bonus: I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError... Here is the output:

Traceback (most recent call last):
  File "./test.py", line 31, in <module>
    for member_info in tardude.getmembers():
  File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
    tarinfo = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2315, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib/python2.7/gzip.py", line 429, in seek
    self.read(1024)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 320, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L
Kaurin
  • 294
  • 1
  • 3
  • 9
  • I've tried the tarfile module with a LOT of files, the thing is that tarfile.TarFile module stores all read(or wrote) members into it's "members". So it'll take a lot of memory when you intentionally read a tarbomb with lots, lots of small files. – tdihp Feb 10 '14 at 03:05

3 Answers3

3

Just a minor improvement on Aya's answer to make things a little more idiomatic (although I'm removing some of the error checking to make the mechanics more visible):

BLOCK_SIZE = 1024

with tarfile.open("zero.tar.gz") as tardude:
    for member in tardude.getmembers():
        with tardude.extractfile(member.name) as target:
            for chunk in iter(lambda: target.read(BLOCK_SIZE), b''):
                pass

This really just removes the while 1: (sometimes considered a minor code smell) and the if not data: check. Also note that the use of with restricts this to Python 2.7+

Community
  • 1
  • 1
ZachP
  • 631
  • 6
  • 11
2

I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError...

If you look at the traceback, you'll see it's being thrown when you call tardude.getmembers(), so you'll need something like...

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

try:
    members = tardude.getmembers()
except:
    print "There was an error reading tarfile members."

for member_info in members:
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

As for the original problem, you're almost there. You just need to read the data from your check object with something like...

BLOCK_SIZE = 1024

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

try:
    members = tardude.getmembers()
except:
    print "There was an error reading tarfile members."

for member_info in members:
    try:            
        check = tardude.extractfile(member_info.name)
        while 1:
            data = check.read(BLOCK_SIZE)
            if not data:
                break
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

...which should ensure you never use more than BLOCK_SIZE bytes of memory at a time.

Also, you should try to avoid using...

try:
    do_something()
except:
    do_something_else()

...because it will mask unexpected exceptions. Try to only catch the exception you actually intend to handle, like...

try:
    do_something()
except IOError:
    do_something_else()

...otherwise you'll find it more difficult to detect bugs in your code.

Aya
  • 39,884
  • 6
  • 55
  • 55
  • O great! Regading the "except:" stuff... I know about that... I usually have "except this:" "except that:"... "except:", but this was just for testing :D – Kaurin Apr 15 '13 at 11:33
  • I have done the following: http://pastie.org/7585277. As you can see, there is a check member_info.isfile, because parsing directories always gives an error. I would also like to skip parsing anything but plain files. – Kaurin Apr 15 '13 at 14:51
  • 1
    You'll need to check the `member_info` object inside the for-loop. Something like `if not member_info.isfile(): continue` ought to work. – Aya Apr 15 '13 at 14:58
  • This kinda helped, but also didn't (due to the Except: and tarfile issues, could you please edit them out of the code examples above?): I **still** got bad tars to pass this test - the **only** way to make sure the tar is correct is by calling `tardude.extractall("/some/tmp/dir")` – Badmaster Feb 08 '16 at 15:31
1

You can use the subprocess module to call gzip -t on the file...

from subprocess import call
import os

with open(os.devnull, 'w') as bb:
    result = call(['gzip', '-t', "zero.tar.gz"], stdout=bb, stderr=bb)

If result is not 0, something is amiss. You might want to check if gzip is available, though. I wrote a utility function for that;

import subprocess
import sys
import os

def checkfor(args, rv = 0):
    """Make sure that a program necessary for using this script is
    available.

    Arguments:
    args  -- string or list of strings of commands. A single string may
             not contain spaces.
    rv    -- expected return value from evoking the command.
    """
    if isinstance(args, str):
        if ' ' in args:
            raise ValueError('no spaces in single command allowed')
        args = [args]
    try:
        with open(os.devnull, 'w') as bb:
            rc = subprocess.call(args, stdout=bb, stderr=bb)
        if rc != rv:
            raise OSError
    except OSError as oops:
        outs = "Required program '{}' not found: {}."
        print(outs.format(args[0], oops.strerror))
        sys.exit(1)
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • Sorry, I forgot to mention that I want to use a pythonic approach, without resorting to subprocess. Thank you for your answer, though! – Kaurin Apr 15 '13 at 11:28