1

I want to extract and process all the files in a zipped file?

import re
import zipfile
import pathlib
import pandas as pd


# Download mHealth dataset
def parse(zip_file):
    # Extract all the files in output directory
    with zipfile.ZipFile(zip_file, "r") as zfile:

        for file in zfile.extractall():
            if file.is_file():
                old_name = file.stem
                extension = file.suffix
                directory = file.parent

                new_name = re.sub("mHealth_", "", old_name) + extension
                file = file.rename(pathlib.Path(directory, new_name))
        zfile.close()
        return file

Traceback error:

Traceback (most recent call last):   
File "C:\Users\User\PycharmProjects\algorithms\project_kmeans.py", line 47,
in <module>
    df_ = parse(zip_file_)   File "C:\Users\User\PycharmProjects\algorithms\project_kmeans.py", line 12,
in parse
    for file in zfile.extractall(): TypeError: 'NoneType' object is not iterable

Process finished with exit code 1
Dharman
  • 30,962
  • 25
  • 85
  • 135
melil
  • 81
  • 8
  • error shows you in which line is problem - so use `print()` to see what you have in variables in this line. – furas Jul 19 '21 at 05:16
  • if you use `with open(..) as zfile` then you don't need `close()` because `with .... as ...` will close it automatically. – furas Jul 19 '21 at 05:18
  • did you check `extractall` in documentation? – furas Jul 19 '21 at 05:20
  • `return file` will return only last file. If you want to get all filenames then you should append them to list and return this list. – furas Jul 19 '21 at 05:27

1 Answers1

1

You need infolist() or namelist() instead of extractall() to work with for-loop.

extractall() extracts files from zip but it doesn't give file names - so it can't be used with for-loop.

infolist() or namelist() gives file names but it makes other problem because it gives object ZipInfo or string, not Path, so it doesn't have .is_file, .stem, etc. You would have to convert to Path.

import zipfile
import pathlib
import pandas as pd

# Download mHealth dataset
def parse(zip_file):
    
    results = []
    
    # Extract all the files in output directory
    with zipfile.ZipFile(zip_file, "r") as zfile:

        zfile.extractall()  # extract
        
        #for filename in zfile.namelist():
        #    path = pathlib.Path(filename)

        for fileinfo in zfile.infolist():
            filename = fileinfo.filename
            path = pathlib.Path(filename)

            if path.is_file():
                old_name = path.stem
                extension = path.suffix
                directory = path.parent

                new_name = old_name.replace("mHealth_", "") + extension
                path = path.rename(pathlib.Path(directory, new_name))
                print('path:', path)
                results.append([filename, new_name])
                
    df = pd.DataFrame(results, columns=['old', 'new'])
    return df

df = parse('test.zip')
print(df)

Doc: infolist and extractall

furas
  • 134,197
  • 12
  • 106
  • 148