0

I'm trying to unzip a zipfile (compressed with BZ2) into a directory. The zipfile contains multiple files.

All (and I've seen quite a few already...) of the examples show how to decompress the zipfile into one file.

This is what I have so far:

def unzipBzip2(passed_targetDir, passed_zipfile):
    full_zipfile = pathlib.Path(constants.APP.ROOT, constants.DOWNLOAD_FOLDER, passed_zipfile)
    full_target = pathlib.Path(constants.APP.ROOT, constants.DOWNLOAD_FOLDER, passed_targetDir)
    
    with open(file=full_zipfile, mode="rb") as zipfile, open(full_target, 'wb') as target:
        decompressor = bz2.BZ2Decompressor()

        for data in iter(lambda : zipfile.read(100*1024), b''):
            target.write(decompressor.decompress(data))

    return

Error is:

Traceback (most recent call last):
  ... (stack) ...
  File "/Users/bert/Project/unzipBzip2.py", line 26, in unzipBzip2
    with open(file=fullzipfile, mode="rb") as zipfile, open(full_target, 'wb') as target:
IsADirectoryError: [Errno 21] Is a directory: '/Users/bert/Project/data/51fba56e-c598-491a-a5e4-57373a59367a'

Well, "/Users/bert/Project/data/51fba56e-c598-491a-a5e4-57373a59367a" is indeed a directory. And that's what it should be, since the unzipped files (from the BZ2 zipfile) should be written in that directory.

Why does decompressor complain that this is a directory?

If I change the target to a file

    full_target = pathlib.Path(constants.APP.ROOT, constants.DOWNLOAD_FOLDER, passed_targetDir, 'x.x')

it gives the following error:

  File "/Users/bert/Project/unzipBzip2.py", line 30, in unzipBzip2
    target.write(decompressor.decompress(data))
OSError: Invalid data stream
martineau
  • 119,623
  • 25
  • 170
  • 301
BertC
  • 2,243
  • 26
  • 33
  • 1
    I think you are confusing zip archives which contain one or more member files and [BZ2}(https://docs.python.org/3/library/bz2.html) which is just a way to compress a single file — it's not a container of other files like the former. – martineau Oct 26 '21 at 12:45
  • what's the extension of your zipfile? `tar.bz2` ? – emptyhua Oct 26 '21 at 13:48
  • @emptyhua, don't mind the extension. It's *.bzip2.zip. That confused me. It now appears to be a 7z zipfile. And that one does have more than one file in it. However the Python package py7zr does not recognise it, while the linux command (7z) does. – BertC Oct 26 '21 at 13:55
  • If you could post a (small) sample file somewhere, I may be able help write code to recognize and decompress it. – martineau Oct 27 '21 at 19:53

1 Answers1

0

If your zipfile is a bz2 compressed zip, the code below should work.

def unzipBzip2(passed_targetDir, passed_zipfile):
    full_zipfile = pathlib.Path(constants.APP.ROOT, constants.DOWNLOAD_FOLDER, passed_zipfile)
    full_target = pathlib.Path(constants.APP.ROOT, constants.DOWNLOAD_FOLDER, passed_targetDir)

    with open(file=full_zipfile, mode="rb") as rawf:
        with bz2.BZ2File(rawf) as bz2f:
            with zipfile.ZipFile(bz2f) as zipf:
                zipf.extractall(full_target)

You could try to use file command to identify archive format. for example your file is abc.unkown.bz2

$ file ./abc.unkown.bz2
./abc.unkown.bz2: bzip2 compressed data, block size = 900k

now we can decompress it using bzip2, and got abc.unkown

$ bzip2 -d ./abc.unkown.bz2

then continue with de decompressed abc.unkown

$ file ./abc.unkown
./abc.unkown: Zip archive data, at least v1.0 to extract

the example file is zip format inside bz2

emptyhua
  • 6,634
  • 10
  • 11
  • Even when there are more than one file in the bz2? Does that mean that bz2 is not only for one-file-compression but for more files? – BertC Oct 26 '21 at 14:04
  • bz2 can only compress one file. there 's another archive format inside bz2 to support multi files. eg: tar.bz2 tar file compressed in bz2 . you need to confirm what's the format compressed in your bz2 file? – emptyhua Oct 26 '21 at 14:10
  • The **tar** file format is not "inside" `bz2`, it's a yet another (rather ancient) standalone [computer archive file format](https://en.wikipedia.org/wiki/Tar_(computing)) — which could be `bx2` compressed like any other single file. – martineau Oct 27 '21 at 19:49
  • @martineau i updated my answer, it may help to identify the archive format. – emptyhua Oct 27 '21 at 23:13
  • That's definitely an improvement. The OP will still need to determine the type of data in the uncompressed file *programmatically* to determine whether to use the [`zipfile`](https://docs.python.org/3/library/zipfile.html#module-zipfile) or [`tarfile`](https://docs.python.org/3/library/tarfile.html#module-tarfile) module in the standard library to extract the files from a Zip archive or tarball, as appropriate. – martineau Oct 27 '21 at 23:24