9

I need to generate a list of files with paths that contain a certain string by recursively searching. I'm doing this currently like this:

for i in iglob(starting_directory+'/**/*', recursive=True):
    if filemask in i.split('\\')[-1]: # ignore directories that contain the filemask
        filelist.append(i) 

This works, but when crawling a large directory tree, it's woefully slow (~10 minutes). We're on Windows, so doing an external call to the unix find command isn't an option. My understanding is that glob is faster than os.walk.

Is there a faster way of doing this?

Noise in the street
  • 589
  • 1
  • 6
  • 20
  • 1
    What version of python is this? `glob.iglob` and `os.walk` were both updated to be 2-20 times faster following [PEP 471](https://www.python.org/dev/peps/pep-0471/). – FHTMitchell Jun 20 '18 at 12:50
  • I'm running python 3.6. For everyone's SA, the PEP you refer to was from 2014. – Noise in the street Jun 20 '18 at 13:31
  • Yeah, but lots of people still use python 2.7 which doesn't have these advantages. I ran a test case with `os.walk` and `glob.iglob` and `walk` was 20 % faster than `iglob` over directory structures that take ~5 seconds to iterate over. I don't think you're going to get much faster with python. Perhaps try cygwin's `find`. – FHTMitchell Jun 20 '18 at 14:34

2 Answers2

30

Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).

Python 3.6:

import os
import glob

def walk():
    pys = []
    for p, d, f in os.walk('.'):
        for file in f:
            if file.endswith('.py'):
                pys.append(file)
    return pys

def iglob():
    pys = []
    for file in glob.iglob('**/*', recursive=True):
        if file.endswith('.py'):
            pys.append(file)
    return pys

def iglob2():
    pys = []
    for file in glob.iglob('**/*.py', recursive=True):
        pys.append(file)
    return pys

# I also tried pathlib.Path.glob but it was slow and error prone, sadly

%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using GNU find (4.6.0) on cygwin (4.6.0-1)

Edit: The below is on WINDOWS, on LINUX I found find to be about 25% faster

$ time find . -name '*.py' > /dev/null

real    0m8.827s
user    0m1.482s
sys     0m7.284s

Seems like os.walk is as good as you can get on windows.

FHTMitchell
  • 11,793
  • 2
  • 35
  • 47
0

os.walk() uses scandir which is the fastest and we get the file object that can be used for many other purposes as well like, below I am getting the modified time. Below code implement recursive serach using os.scandir()

import os
import time
def scantree(path):
    """Recursively yield DirEntry objects for given directory."""
    for entry in os.scandir(path):
        if entry.is_dir(follow_symlinks=False):
            yield from scantree(entry.path) 
        else:
            yield entry
        
for entry in scantree('/home/'):
    if entry.is_file():
        print(entry.path,time.ctime(entry.stat().st_mtime))
Captain Hat
  • 2,444
  • 1
  • 14
  • 31
Sreenath D
  • 11
  • 2