Differentiating between compressed .gz files and archived tar.gz files properly?

Question

What is the proper way to deal with differentiating between a plain compressed file in gzip or bzip2 format (eg. .gz) and a tarball compressed with gzip or bzip2 (eg. .tar.gz) Identification using suffix extensions is not a reliable option as it's possible files may end up renamed.

Now on the command line I am able to do something like this:

bzip2 -dc test.tar.bz2 |head|file -

So I attempted something similar in python with the following function:

def get_magic(self, store_file, buffer=False, look_deeper=False):
    # see what we're indexing
    if look_deeper == True:
        m = magic.Magic(mime=True, uncompress=True)
    else:
        m = magic.Magic(mime=True) 

    if buffer == False:
        try:
            file_type = m.from_file(store_file)

        except Exception, e:
            raise e

    else:
        try:
            file_type = m.from_buffer(store_file)

        except Exception, e:
            raise e

    return file_type

Then when trying to read a compressed tarball I'll pass in the buffer from elsewhere via:

    file_buffer = open(file_name).read(8096) 
    archive_check = self.get_magic(file_buffer, True, True)

Unfortunately this then becomes problematic using the uncompress flag in python-magic because it appears that python-magic is expecting me to pass in the entire file even though I only want it to read the buffer. I end up with the exception:

bzip2 ERROR: Compressed file ends unexpectedly

Seeing as the the files I am looking at can end up being 2M to 20GB in size this becomes rather problematic. I don't want to read the entire file.

Can it be hacked and chop the end of the compressed file off and append it to the buffer? Is it better to ignore the idea of uncompressing the file using python-magic and instead do it before I pass in a buffer to identify via:

    file_buffer = open(file_name, "r:bz2").read(8096)

Is there a better way?

Mark Adler · Accepted Answer · 2016-08-24T02:46:42.880

0

It is very likely a tar file if the uncompressed data at offset 257 is "ustar", or if the uncompressed data in its entirety is 1024 zero bytes (an empty tar file).

You can read just the first 1024 bytes of the uncompressed data using z = zlib.decompressobj() or z = bz2.BZ2Decompressor(), and z.decompress().

edited Aug 24 '16 at 02:46

answered Aug 23 '16 at 23:59

Mark Adler

101,978
13
118
158

score 0 · Answer 2 · answered Mar 30 '17 at 20:30

I'm actually going to mark Mark's answer as the correct one as it gave me the hint.

I ended up dumping the project to do other things for a good six months and was stumped as the bz2.BZ2Decompressor didn't seem to be doing as it was supposed to. It turns out the problem isn't solvable in 1024 bytes.

#!/usr/bin/env python

import os
import bz2
import magic

store_file = "10mb_test_file.tar.bz2"
m = magic.Magic(mime=True)

file_buffer = open(store_file, "rb").read(1000000)
buffer_chunk = ""

decompressor = bz2.BZ2Decompressor()
print ( "encapsulating bz2" )
print ( type(file_buffer) )
print ( len(file_buffer) )
file_type = m.from_buffer(file_buffer)
print ( "file type: %s :" % file_type)

buffer_chunk += decompressor.decompress( file_buffer )
print ( "compressed file contents" )
print ( type(buffer_chunk) )
print ( len(buffer_chunk) )

file_type = m.from_buffer(buffer_chunk)
print ( "file type: %s :" % file_type)

Strangely, with a 20MB tar.bz2 file I can use a value of 200,000 bytes rather than 1,000,000 bytes but this value won't work on the 10MB test file. I don't know if it is specific to the tar.bz2 archive involved and I haven't looked into the algorithms involved to see if they are at specific points but reading roughly 10MB of data so far seems to work on every archive file up to 5GB. An open().read(buffer) will read up to the size of the buffer or EOF so this is okay.

Differentiating between compressed .gz files and archived tar.gz files properly?

2 Answers2