0

I want to find all files in a directory tree with a given file extension. However, some folders are really large and I therefore want to stop this process if it takes too long time (say 1 second). My current code looks something like this:

import os
import time

start_time = time.time()
file_ext = '.txt'
path = 'C:/'
file_list = []
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(file_ext):
            relDir = os.path.relpath(root, path)
            relFile = os.path.join(relDir, file)
            file_list.append(relFile)
        if time.time() - start_time> 1:
            break
    if time.time() - start_time> 1:
        break

The problem with this code is that when I get to a really large subfolder, this code does not break until that folder has been completely traversed. If that folder contains many files, it might take much longer time than I would like. Is there any way I can make sure that the code does not run for much longer than the allotted time?

Edit: Note that while it is certainly helpful to find ways to speed up the code (for instance by using os.scandir), this question deals primarily with how to kill a process that is running.

matnor
  • 479
  • 1
  • 5
  • 17
  • put it in a function and use return ? – Joran Beasley Oct 20 '16 at 20:59
  • Unfortunately, that produces the same result. – matnor Oct 20 '16 at 21:00
  • your indentation is likely wrong then ...try copy pasting your question back into your editor and see if it works – Joran Beasley Oct 20 '16 at 21:01
  • just tested the code you posted, and it always exits almost immediatly after 1 second (1.00009) ... – Joran Beasley Oct 20 '16 at 21:04
  • 2
    @JoranBeasley I think the problem is that os.walk won't give you the file list until the directory has been read completely so it does no good to check time while enumerating the files. – tdelaney Oct 20 '16 at 21:05
  • Yeah, that's correct. Unless there is a really large folder, this script will exit after almost exactly 1 second. But when there is a folder containing several 100,000 files, then it no longer works. – matnor Oct 20 '16 at 21:06
  • ahh good point now I see what you mean .... sorry – Joran Beasley Oct 20 '16 at 21:06
  • Possible duplicate of [Is there a way to efficiently yield every file in a directory containing millions of files?](http://stackoverflow.com/questions/5090418/is-there-a-way-to-efficiently-yield-every-file-in-a-directory-containing-million) – zvone Oct 20 '16 at 22:30
  • I think my question is a bit different in that even with a more efficient way to walk through the directories, I still want to be able to kill the process. – matnor Oct 21 '16 at 14:23

1 Answers1

0

You can do the walk in a subprocess and kill that. Options include multiprocessing.Process but the multiprocessing libs on Windows may need to do a fair amount of work that you don't need. Instead, you can just pipe the walker code into a python subprocess and go from there.

import os
import sys
import threading
import subprocess as subp

walker_script = """
import os
import sys
path = os.environ['TESTPATH']
file_ext = os.environ['TESTFILEEXT']

# let parent know we are going
print('started')

for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(file_ext):
            relDir = os.path.relpath(root, path)
            relFile = os.path.join(relDir, file)
            print(relFile)
"""

file_ext = '.txt'
path = 'C:/'

encoding = sys.getdefaultencoding()

# subprocess reads directories... additional python flags seek to
# speed python initialization. If a linuxy system, forking would
# be a good option.

env = {'TESTPATH':path, 'TESTFILEEXT':file_ext}
env.update(os.environ)
proc = subp.Popen([sys.executable, '-E', '-s', '-S', '-'], stdin=subp.PIPE,
    stdout=subp.PIPE,      # , stderr=open(os.devnull, 'wb'))
    env = env)

# write walker script
proc.stdin.write(walker_script.encode('utf-8'))
proc.stdin.close()

# wait for start marker
next(proc.stdout)

# timer kills directory traversal when bored
threading.Timer(1, proc.kill).start()

file_list = [line.decode(encoding).strip() for line in proc.stdout]
print(file_list)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • Using the above code gives me the error `Fatal Python error: Failed to initialize Windows random API (CryptoGen)`. I'm not very familiar with using subprocess, so I'm not really sure how to interpret the error. Thanks for the suggestion, I'll have to dig a bit deaper into how to use subprocesses. – matnor Oct 21 '16 at 14:19
  • @matnor - I was a bit over agressive in trimming the processing environment of the child. I updated the code to clone the environment of the parent. If you still have a problem, you can remove the `-E -s -S` flags, which are only there to speed python load time. – tdelaney Oct 21 '16 at 17:43
  • Thanks, the code runs now. Unfortunately, the script isn't killed after one second, but keeps running. – matnor Oct 21 '16 at 18:17