How to categorize files within a zip archive into a list in Python?

Question

I am trying to work with a zip archive in Kaggle, and access the files inside a train.zip file to then train my model. This file includes images of Cats and Dogs and the filenames reveal if the image is an image of a cat or a dog. I think can do this by reading the zip archive and then create lists of the number of Cat and Dog images.

I know I can use this code to read the zip archive:

with zipfile.ZipFile("../input/dogs-vs-cats/train.zip","r") as z:
    z.extractall(".")

print(check_output(["ls", "train"]).decode("utf8"))

Also, the code below, can be used to categorize files, providing that we have them unzipped. However, it seems the file is not unzipped and we have only read it using the code above. So, I don't know how I can mate these two codes to be able to read file names.

categories = []
for filename in filenames:
    category = filename.split('.')[0]
    if category == 'dog':
        categories.append(1)
    else:
        categories.append(0)

df = pd.DataFrame({
    'filename': filenames,
    'category': categories
})
print (categories)

The problem is that it seems the filenames can only be a string and I cannot assign the output of the first code (containing the ZipFile command) to it. I think by adding the following code, I can read the directory and assign values to the filenames; however,the file should be unzipped.

filenames = os.listdir("../input/dogs-vs-cats/")

So, I wonder how I can feed the zipfile to the categorization code, or how I can unzip the file in Kaggle in a way that files can be found in the directory?

score 0 · Accepted Answer · answered Jun 26 '20 at 07:36

OK, no one answered my question, but I could find a solution for it. Actually, I could find were my problem was! I mention it here, so others working with Kaggle zip files can use it.

My codes (which are actually stolen codes:) were all correct. The only problem was that I was looking at the wrong directory! I used the os.listdir() function to understand how the files are structured in Kaggle and where the extracted files are located. (You can use extract() function to extract files within a zip archive)

So, if you have a zip archive in Kaggle and want to use it, just use the following code. Remember that you can use the whole code, which is a categorization of files by name, or just use the part that I explore the zipped archive. You DO NOT have to do a categorization like me, since I was categorizing files to use in a Convolutional Neural Network (CNN). You may decide to do other sorts of categorizations.

# importing necessary libs and packages
from subprocess import check_output
import numpy as np
import pandas as pd 
import os
import zipfile

# opening and viewing the files
with zipfile.ZipFile("../input/dogs-vs-cats/train.zip","r") as z:
    z.extractall(".")
print(check_output(["ls", "train"]).decode("utf8"))

# categorizing files into categories of 0 and 1 to use as labels
filenames = os.listdir("../working/train")
categories = []
for filename in filenames: 
    category = filename.split('.')[0] #here is the part that your code can be different to mine- you can categorize files differently and with a different approach
    if category == 'dog':
        categories.append(1)
    else:
        categories.append(0)

df = pd.DataFrame({
    'filename': filenames,
    'category': categories
})
print (categories)

Just remember, if the code did not work, it is probably due to the wrong path. Use os.listdir() to find the correct path.

How to categorize files within a zip archive into a list in Python?

1 Answers1