2

What is the most efficent way to get path of subfolders which contain files. For example, if this is my input structure.

inputFolder    
│
└───subFolder1
│   │
│   └───subfolder11
│       │   file1.jpg
│       │   file2.jpg
│       │   ...
│   
└───folder2
    │   file021.jpg
    │   file022.jpg

If I pass getFolders(inputPath), it should return the output as a list of folders containig images ['inputFolder/subFolder1/subFolder11','inputFolder/folder2']

Currently I'm making use of my library TreeHandler, which is just a wrapper of os.walk to get all the files.

import os
from treeHandler import treeHandler
th=treeHandler()
tempImageList=th.getFiles(path,['jpg'])
### basically tempImageList will be list of path of all files with '.jpg' extension

### now is the filtering part,the line which requires optimisation.
subFolderList=list(set(list(map(lambda x:os.path.join(*x.split('/')[:-1]),tempImageList))))

I think it can be done more efficiently.

Thanks in advance

Sreekiran A R
  • 3,123
  • 2
  • 20
  • 41

3 Answers3

1
import os
import glob

original_path = './inputFolder/'

def get_subfolders(path):
    return [f.path for f in os.scandir(path) if f.is_dir()]

def get_files_in_subfolder(subfolder, extension):
    return glob.glob(subfolder + '/*' + extension)

files = []
subfolders = [original_path] + get_subfolders(original_path)
while len(subfolders) > 0:
    new_subfolder = subfolders.pop()
    print(new_subfolder)
    subfolders += get_subfolders(new_subfolder)
    files += get_files_in_subfolder(new_subfolder, '.jpg')
pierresegonne
  • 436
  • 1
  • 4
  • 15
1
  • Splitting all the parts of a path and re-joining them seems to reduce efficiency.
  • Finding the index of the last instance of '/' and slicing works much faster.

    def remove_tail(path):
        index = path.rfind('/') # returns index of last appearance of '/' or -1 if not present
        return (path[:index] if index != -1  else '.') # return . for parent directory
    .
    .
    .
    subFolderList = list(set([remove_tail(path) for path in tempImageList]))
    
  • Verified on AWA2 dataset folders (50 folders and 37,322 images).

  • Observed about 3 times faster result.
  • Readability enhanced using the list comprehension.
  • Handled case where the parent directory has images (which would result in an error with the existing implementation)

Adding the code used for verification

import os
from treeHandler import treeHandler
import time

def remove_tail(path):
    index = path.rfind('/')
    return (path[:index] if index != -1  else '.')

th=treeHandler()
tempImageList= th.getFiles('JPEGImages',['jpg'])
tempImageList = tempImageList
### basically tempImageList will be list of path of all files with '.jpg' extension

### now is the filtering part,the line which requires optimisation.
print(len(tempImageList))
start = time.time()
originalSubFolderList=list(set(list(map(lambda x:os.path.join(*x.split('/')[:-1]),tempImageList))))
print("Current method takes", time.time() - start)

start = time.time()
newSubFolderList = list(set([remove_tail(path) for path in tempImageList]))
print("New method takes", time.time() - start)

print("Is outputs matching: ", originalSubFolderList == newSubFolderList)
Sreeragh A R
  • 2,871
  • 3
  • 27
  • 54
0
subfolders = [f.path for f in os.scandir(input_folder) if f.is_dir() and any(fname.endswith(tuple(['.png', '.jpeg', '.jpg'])) for fname in os.listdir(f.path))]
Manuel
  • 1
  • 3
    Answer needs supporting information Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](https://stackoverflow.com/help/how-to-answer). – moken Jul 20 '23 at 08:25