1

Need to concatenate some files from github which have been split into several pieces due to the size (as from this dataset https://github.com/kang-gnak/eva-dataset)

Using request these end up in my temporary data storage in the format File_Name.zip.001 to File_Name.zip.007

The completed file is not text but images so I haven't found a straightforward way to rebuild File_Name.zip from code.

Is anyone aware of a solution that would work directly in Colab?

I am looking for both repeatability and the ability to share my code as a Colab notebook, so I am trying to avoid solutions that involve having to download and rebuild the file locally and reuploading it each time. I would also prefer not to have to make an online copy of existing data if there's a way to rebuild and unzip the file directly from the code.

Thanks in advance.

I attempted using a list of the parts' file names assigned to

data_zip_parts

and run the following code:

with zipfile.ZipFile(data_path / "File_Name.zip", 'a') as full_zip:
    for file_name in data_zip_parts:
        part = zipfile.ZipFile(data_path / file_name, 'r')
        for name in part.namelist():
            full_zip.writestr(name, zipfile.open(name).read())

However looks like this file format cannot be read directly so I get the following error:

BadZipFile: File is not a zip file

Just a reminder that I want to try to do this directly within Google Colab: I have asked a few peers but most of them gave me solutions to run on my local system such as command line or using 7zip, which isn't quite what I'm looking for, but I expect there may be a way to work around this format, and would appreciate the assistance.

pmqs
  • 3,066
  • 2
  • 13
  • 22
g6k
  • 13
  • 3

1 Answers1

0

Understanding the Issue

I downloaded the dataset from https://github.com/kang-gnak/eva-dataset to see what you are dealing with

$ ls -lh *
-rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.001
-rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.002
-rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.003
-rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.004
-rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.005
-rw-rw-r-- 1 paul paul  99M Oct  7 04:11 EVA_together.zip.006
-rw-rw-r-- 1 paul paul  70M Oct  7 04:11 EVA_together.zip.007

Let's see what the file command says about the content of these files

$ file *
EVA_together.zip.001: Zip archive data, at least v2.0 to extract, compression method=store
EVA_together.zip.002: data
EVA_together.zip.003: data
EVA_together.zip.004: data
EVA_together.zip.005: data
EVA_together.zip.006: data
EVA_together.zip.007: OpenPGP Public Key

As I expected, only the first is actually appears to be a zip file, but even it has problems

$ unzip -t EVA_together.zip.001
Archive:  EVA_together.zip.001
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of EVA_together.zip.001 or
        EVA_together.zip.001.zip, and cannot find EVA_together.zip.001.ZIP, period.

The Root-Cause

The issue here is the composite zip file made up from all the EVA_together.zip.001 .. EVA_together.zip.007 files is just a simple split of a large zip file.

Taken in isolation that means none of these files is a valid well-formed zip file. All are just fragments.

The Fix

To recreate the composite zip file you just need to concatenate the individual parts

$ cat EVA_together.zip.00* >EVA_together.zip
$ ll -lh EVA_together.zip
-rw-rw-r-- 1 paul paul 664M Dec  6 09:31 EVA_together.zip

Check that we now have a valid zip file

$ file EVA_together.zip
EVA_together.zip: Zip archive data, at least v2.0 to extract, compression method=store

$ unzip -t EVA_together.zip
Archive:  EVA_together.zip
    testing: EVA_together/            OK
    testing: EVA_together/10021.jpg   OK
    testing: EVA_together/100397.jpg   OK
...
    testing: EVA_together/99711.jpg   OK
    testing: EVA_together/99725.jpg   OK
    testing: EVA_together/9993.jpg    OK
    testing: EVA_together/9999.jpg    OK
No errors detected in compressed data of EVA_together.zip.

I believe that colab allows a shell escape, so writing the concatenation code in Python may not be needed. Depends on your workflow

pmqs
  • 3,066
  • 2
  • 13
  • 22