-1

I am trying to loop through my subdirectories to read in my zip files. I am getting error TypeError: 'WindowsPath' object is not iterable

What i am trying:

path = Path("O:/Stack/Over/Flow/")
for p in path.rglob("*"):
     print(p.name)
     zip_files = (str(x) for x in Path(p.name).glob("*.zip"))
     df = process_files(p)   #function

What does work - when I go to the folder directly with my path:

path = r'O:/Stack/Over/Flow/2022 - 10/'
zip_files = (str(x) for x in Path(path).glob("*.zip"))
df = process_files(zip_files)

any help would be appreciated.

Directory structure is like:

 //Stack/Over/Flow/2022 - 10/Original.zip 
 //Stack/Over/Flow/2022 - 09/Next file.zip

function i call:

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd


def process_files(files: list) -> pd.DataFrame:
    file_mapping = {}
    for file in files:
        #data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)
        
        archive = ZipFile(file)

        # find file names in the archive which end in `.xls`, `.xlsx`, `.xlsb`, ...
        files_in_archive = archive.namelist()
        excel_files_in_archive = [
            f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
        ]
        # ensure we only have one file (otherwise, loop or choose one somehow)
        assert len(excel_files_in_archive) == 1

        # read in data
        data_mapping = pd.read_excel(
            BytesIO(archive.read(excel_files_in_archive[0])),
            sheet_name=None,
        )

        row_counts = []
        for sheet in list(data_mapping.keys()):
            row_counts.append(len(data_mapping.get(sheet)))

        file_mapping.update({file: sum(row_counts)})

    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = ["file_name", "row_counts"]

    return frame

New : what I am trying

for root, dirs, files in os.walk(dir_path):
    for file in files:
        print(files)
        if file.endswith('.zip'):
            df = process_files(os.path.join(root, file))
            print(df) #function
        else:
            print("nyeh")

This is returning files like Original - All fields - 11012021 - 11302021.zip but then i get an error OSError: [Errno 22] Invalid argument: '\\'

Jonnyboi
  • 505
  • 5
  • 19
  • Not enough information, always post the complete Traceback. What does the directory structure look like? Why didn't you include the zip pattern in the rglob call? – wwii Dec 19 '22 at 15:12
  • You call ` df = process_files(p)` instead of ` df = process_files(zip_files)`... – Tomerikoo Dec 28 '22 at 17:35

1 Answers1

0

A possible solution using os.walk():

zip_files = []
for root, dirs, files in os.walk(main_path):
    for file in files:
        if file.endswith('.zip'):
            zip_files.append(os.path.join(root, file))
df = process_files(zip_files)   #function
Prakash Dahal
  • 4,388
  • 2
  • 11
  • 25
  • hi prakash, just to clarify, I add a line like `main_path = r'O:/Stack/Over/Flow'/`? I am getting `FileNotFoundError: [Errno 2] No such file or directory: 'O'` – Jonnyboi Dec 19 '22 at 14:37
  • Can you remove `r` and try again? – Prakash Dahal Dec 19 '22 at 18:43
  • still getting that error :( – Jonnyboi Dec 19 '22 at 19:19
  • Ok i think it was because I was using a mapped drive, im using absolute path now. Now getting a new error `NameError: name 'df' is not defined` – Jonnyboi Dec 20 '22 at 02:05
  • This error happens if you try to use `df` variable where it is not accessible, – Prakash Dahal Dec 20 '22 at 05:29
  • Any idea on how this can be fixed prakash? – Jonnyboi Dec 20 '22 at 15:30
  • This is pretty simple error, you are trying to use df where it is not reachable, you can include the traceback in the question – Prakash Dahal Dec 20 '22 at 16:40
  • Okay so I am using `print(files)` and it prints some of them then I get an error `OSError: [Errno 22] Invalid argument: '\\'` , any advice on what to do? will update my code. – Jonnyboi Dec 26 '22 at 20:29
  • In the `process_files()` you are expecting to get the list of zip file, but while calling you are passing the path of a single file, which might be the cause of error. I have updated the code to get the list instead of path str. – Prakash Dahal Dec 28 '22 at 17:28