0

How can I loop over 2 folders? In Apple and all its subfolders, I want to look for Excel files that contain "green". In Banana, I want to look for files that contain "yellow". I explicitly need to specify the folder paths and can't just loop over the whole C drive.

import os
folders = ['C:/Desktop/apple', 'C:/Downloads/banana']
for x in in range(len(folders)):
    for root, dirs, files in os.walk(folders[i]):
        for file in files:
            if file.endswith(".xlsx") and "banana" in folders[i] and "yellow" in file:
                df = pd.read_excel(os.path.join(root, file))
                df['date'] = pd.to_datetime(df.date)
                ...

            if file.endswith(".xlsx") and "apple" in folders[i] and "green" in file:
                df = pd.read_excel(os.path.join(root, file))
                df['date'] = pd.to_datetime(df.date)
                ...

Since all the excel files look the same, my code above is cumbersome since I'm duplicating the code to read the dataframe and clean the df.

asd
  • 1,245
  • 5
  • 14

2 Answers2

1

The easiest way to get all the file paths that match your condition would be to use glob package:

import glob
for file in glob.glob('C:/Desktop/apple/*green*.xlsx') + glob.glob('C:/Desktop/banana/*yellow*.xlsx'):
    print(file)
    df = pd.read_excel(os.path.join(root, file))
    df['date'] = pd.to_datetime(df.date)

Glob uses regex pattern matching. If you want choose files that only start with green, you may remove the first asterisk like sogreen*.

To this using pathlib:

from pathlib import Path
for file in [f"C:/Desktop/{f}" for f in list(Path('apple').glob('*green*.csv')) + list(Path('banana').glob('*yellow*.csv'))]:
    df = pd.read_excel(os.path.join(root, file))
    df['date'] = pd.to_datetime(df.date)
Aditya
  • 1,357
  • 1
  • 9
  • 19
  • Many thanks, sorry if possible: If for the Apple folder I wanted to search for "green" or "red", would it be possible to add an "or" statement to this line ```glob.glob('C:/Desktop/apple/*green*.xlsx')```. Or would I have to add another + statement? – asd Apr 25 '21 at 17:04
0

You can create a dictionary where keys will be folders and values will be what to search. Pseudocode:

import os

to_search = {                             # <--- the dictionary
    "C:/Desktop/apple": "green",
    "C:/Desktop/banana": "yellow",
}

for folder, item in to_search.items():    # <--- use dict.items()
    for root, dirs, files in os.walk(folder):  # <--- here you use "folder"
        for file in files:
            if file.endswith(".xlsx") and item in file:   # <--- here you use "item"
                df = pd.read_excel(os.path.join(root, file))
                df["date"] = pd.to_datetime(df.date)

                # ...
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91