7

I have huge set of files that I want to traverse through using python. I am using os.walk(source) for the same and is working but since I have a huge set of files it is taking too much and memory resources since its getting the complete list all at once. How can I optimize this to use less resources and may be walk through one directory at a time or in some other efficient manner and still able to iterate the complete set of files. Thanks

for dir, dirnames, filenames in os.walk(START_FOLDER): 
    for name in dirnames: 
        #if PRIVATE_FOLDER not in name: 
            for keyword in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST: 
                if keyword in name.lower(): 
                    ignoreList.append(name)
GVH
  • 416
  • 3
  • 16
nirvana
  • 195
  • 1
  • 4
  • 13
  • 6
    `os.walk` already returns a generator, which is lazy. Are you turning it into a list or something? Because if not, it should not cause memory issues. (Also, post your code.) – senshin Feb 12 '14 at 03:51
  • I want to go through each of the file name and if contains certain keywords I want to add them to a list for dir, dirnames, filenames in os.walk(START_FOLDER): for name in dirnames: #if PRIVATE_FOLDER not in name: for keyword in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST: if keyword in name.lower(): ignoreList.append(name) – nirvana Feb 12 '14 at 04:25
  • Okay. Post your code that does that. – senshin Feb 12 '14 at 04:25
  • @senshin for dir, dirnames, filenames in os.walk(START_FOLDER): for name in dirnames: #if PRIVATE_FOLDER not in name: for keyword in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST: if keyword in name.lower(): ignoreList.append(name) – nirvana Feb 12 '14 at 04:50
  • Comments can't contain formatted code, you'd better edit your post and insert the snippet there. – user3159253 Feb 12 '14 at 05:13
  • 1
    How long does it take just to do the listing itself? As in, if you were to just run "for dir, dirnames, filenames in os.walk(START_FOLDER): pass", is it still unacceptably slow/memory intensive? – GVH Feb 12 '14 at 06:07
  • What is `len(FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST)`? You can hoist the `name.lower()` out of the innermost loop, which can help if the keywords list is very large. – Cuadue Feb 12 '14 at 17:59
  • See [this link](http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/) for a potential speed-up using C. I've run into problems doing lists on 100s of millions of files, and solved the problem using the method described in the link. – GVH Feb 12 '14 at 20:20
  • How do you define huge? A few hundred? Thousand? One hundred million? – Bryan Oakley Feb 13 '14 at 00:15
  • 1
    @senshin: The bottleneck can be an the OS/Python interface. Even in 3.4 all the directory entries are read in at once, which can make reading large directories slow or impossible. See [Issue 11406](http://bugs.python.org/issue11406) for details. – Ethan Furman Feb 14 '14 at 02:13

2 Answers2

3

If the issue is that the directory simply has too many files in it, this will hopefully be solved in Python 3.5.

Until then, you may want to check out scandir.

Ethan Furman
  • 63,992
  • 20
  • 159
  • 237
  • 2
    Yes, [os.scandir()](https://docs.python.org/3.5/library/os.html#os.scandir) was added to Python 3.5 and returns a generator that yields simple [os.DirEntry](https://docs.python.org/3.5/library/os.html#os.DirEntry) objects containing file path and other file attributes. – David Jul 10 '16 at 18:41
2

You should make use of the in keyword to test if a directory name matches a keyword.

for _, dirnames, _ in os.walk(START_FOLDER): 
    for name in dirnames:
        if any((k in name.lower() for k in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST)):
            ignoreList.append(name)

If your ignoreList is too big, you may want to think about creating an acceptedList and using that instead.

Levi
  • 29
  • 3
  • Now it requires python 3, which should be mentioned since OP didn't tag either 2.x or 3.x. – GVH Feb 12 '14 at 20:18
  • @GVH what isn't p3k compatible? – Levi Feb 12 '14 at 23:13
  • You should never have to use `True` and `False` in a ternary `if` expression. `True if x in y else False` is the same as simply `x in y`. Second, you've still mixed up the OP's test. OP checks whether any keyword is a substring of the name; not whether the name is a substring of any keyword. – John Y Feb 12 '14 at 23:24
  • And yet you still didn't correct the direction of the substring test. – John Y Feb 13 '14 at 00:04
  • This will save a little time, but won't help at all if the OP has too many directory entries (and the OP could save the same time by adding a `break` after the `append`). – Ethan Furman Feb 14 '14 at 02:28