I'm trying to use the Python zipfile library to unzip a split ZIP file by concatenating all file splits and then unzipping the final product, but I keep getting hit with the "Bad magic number for file header" error using this library.
I'm writing a Python script which will normally receive a single ZIP file, but will very rarely receive a ZIP file split into multiple parts (for example, foo.zip.001, foo.zip.002, etc). From what I can tell, there's no easy way to deal with this if you need to bundle the script up with its dependencies for a Docker container. However, I stumbled across this SO answer which explains that you can concatenate the files into a single ZIP file and treat it as such. So my battle plan is to concatenate all file splits into one big ZIP file and then unzip this file. I created a test case (with a Mac terminal) using a video file with the following command:
$ zip -s 5m test ch4_3.mp4
Here's my code to concatenate all files together:
import zipfile
split_files = ['test.z01', 'test.z02', 'test.z03', 'test.zip']
with open('test_video.zip', 'wb') as f:
for file in split_files:
with open(file, 'rb') as zf:
f.write(zf.read())
If I go to my terminal and run unzip test_video.zip
, this is the output:
$ unzip test_video.zip
Archive: test_video.zip
warning [test_video.zip]: zipfile claims to be last disk of a multi-part archive;
attempting to process anyway, assuming all parts have been concatenated
together in order. Expect "errors" and warnings...true multi-part support
doesn't exist yet (coming soon).
warning [test_video.zip]: 15728640 extra bytes at beginning or within zipfile
(attempting to process anyway)
file #1: bad zipfile offset (local header sig): 15728644
(attempting to re-compensate)
inflating: ch4_3.mp4
It seems like it hits a bit of a road bump, but it successfully works. However, when I try to run the following code:
if not os.path.exists('output'):
os.mkdir('output')
with zipfile.ZipFile('tester/test_video.zip', 'r') as z:
z.extractall('output')
I get the following error:
---------------------------------------------------------------------------
BadZipFile Traceback (most recent call last)
<ipython-input-60-07a6f56ea685> in <module>()
2 os.mkdir('output')
3 with zipfile.ZipFile('tester/test_video.zip', 'r') as z:
----> 4 z.extractall('output')
~/anaconda3/lib/python3.6/zipfile.py in extractall(self, path, members, pwd)
1499
1500 for zipinfo in members:
-> 1501 self._extract_member(zipinfo, path, pwd)
1502
1503 @classmethod
~/anaconda3/lib/python3.6/zipfile.py in _extract_member(self, member, targetpath, pwd)
1552 return targetpath
1553
-> 1554 with self.open(member, pwd=pwd) as source, 1555 open(targetpath, "wb") as target:
1556 shutil.copyfileobj(source, target)
~/anaconda3/lib/python3.6/zipfile.py in open(self, name, mode, pwd, force_zip64)
1371 fheader = struct.unpack(structFileHeader, fheader)
1372 if fheader[_FH_SIGNATURE] != stringFileHeader:
-> 1373 raise BadZipFile("Bad magic number for file header")
1374
1375 fname = zef_file.read(fheader[_FH_FILENAME_LENGTH])
BadZipFile: Bad magic number for file header
If I try to run it with the .zip file before the others, this is what I get:
split_files = ['test.zip', 'test.z01', 'test.z02', 'test.z03']
with open('test_video.zip', 'wb') as f:
for file in split_files:
with open(file, 'rb') as zf:
f.write(zf.read())
with zipfile.ZipFile('test_video.zip', 'r') as z:
z.extractall('output')
Here's the output:
---------------------------------------------------------------------------
BadZipFile Traceback (most recent call last)
<ipython-input-14-f7aab706dbed> in <module>()
1 if not os.path.exists('output'):
2 os.mkdir('output')
----> 3 with zipfile.ZipFile('test_video.zip', 'r') as z:
4 z.extractall('output')
~/anaconda3/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
1106 try:
1107 if mode == 'r':
-> 1108 self._RealGetContents()
1109 elif mode in ('w', 'x'):
1110 # set the modified flag so central directory gets written
~/anaconda3/lib/python3.6/zipfile.py in _RealGetContents(self)
1173 raise BadZipFile("File is not a zip file")
1174 if not endrec:
-> 1175 raise BadZipFile("File is not a zip file")
1176 if self.debug > 1:
1177 print(endrec)
BadZipFile: File is not a zip file
Using the answer from this SO question, I've worked out that the header is b'PK\x07\x08'
but I don't know why. I also used the testzip()
function and it points straight to the culprit: ch4_3.mp4
.
You can find the ZIP file in question at this link here. Any ideas on what to do?