32

I'm trying to create a utility class for traversing all the files in a directory, including those within subdirectories and sub-subdirectories. I tried to use a generator because generators are cool; however, I hit a snag.


def grab_files(directory):
    for name in os.listdir(directory):
        full_path = os.path.join(directory, name)
        if os.path.isdir(full_path):
            yield grab_files(full_path)
        elif os.path.isfile(full_path):
            yield full_path
        else:
            print('Unidentified name %s. It could be a symbolic link' % full_path)

When the generator reaches a directory, it simply yields the memory location of the new generator; it doesn't give me the contents of the directory.

How can I make the generator yield the contents of the directory instead of a new generator?

If there's already a simple library function to recursively list all the files in a directory structure, tell me about it. I don't intend to replicate a library function.

Evan Kroske
  • 4,506
  • 12
  • 40
  • 59

7 Answers7

65

Why reinvent the wheel when you can use os.walk

import os
for root, dirs, files in os.walk(path):
    for name in files:
        print os.path.join(root, name)

os.walk is a generator that yields the file names in a directory tree by walking the tree either top-down or bottom-up

Nadia Alramli
  • 111,714
  • 37
  • 173
  • 152
  • 50
    But then again, by reinventing the wheel we could `os.cycle` rather than `os.walk`... – mjv Nov 09 '09 at 01:15
  • 10
    I think it's a joke... "reinventing the wheel"? Walking vs. cycling? Pretty good.. :) – Ned Batchelder Nov 09 '09 at 01:18
  • 1
    Yes, Ned, a joke. The suggestion to os.walk() is the way-to-go, unless one is merely trying to learn about generators and uses directory traversal as a practical exercise for it. – mjv Nov 09 '09 at 01:36
  • @Ned: I literally just facepalmed. – Jed Smith Nov 09 '09 at 02:10
  • 12
    os.walk might be a generator, but its granularity is a directory level and the files it returns is a list. If you have a directory with millions of files in it, good luck using os.walk. At least this is true in 2.7. – woot May 17 '15 at 15:05
  • In addition to what woot pointed out, `os.walk` also sorts symbolic links into either the directory or file list based on the files they point to. This is fine much of the time, but not if you are trying to operate on the links instead of the linked-to files. – Pi Marillion Jan 23 '16 at 01:51
  • @woot - that's exactly why I'm here - I'm trying to split my million files into subdirectories (git object style), but os.listing is taking forever... – iAdjunct Dec 15 '17 at 01:17
  • @iAdjunct Look at using scandir. https://github.com/benhoyt/scandir for python 2.x, else it is built in in python 3.x I think. – woot Dec 15 '17 at 06:19
  • Yes, scandir is built into 3 and returns an iterator. listdir still returns a list. – philologon Feb 11 '19 at 22:39
  • `walk()` internally uses `listdir()` so that loses the advantages of using a generator like `walk()`in the first place. Just use `scandir()`. Source: https://hg.python.org/cpython/file/29f0836c0456/Lib/os.py#l276 – Flair Nov 03 '20 at 19:08
  • To be clear, what I said above only applies to Python 2 and versions of Python 3 before 3.5. The maintained versions of Python no longer have that issue with `walk()`: https://github.com/python/cpython/blob/64fc105b2d2faaeadd1026d2417b83915af6622f/Lib/os.py#L357 – Flair Jan 28 '21 at 00:18
15

As of Python 3.4, you can use the glob() method from the built-in pathlib module:

import pathlib
p = pathlib.Path('.')
list(p.glob('**/*'))    # lists all files recursively
EinfachToll
  • 151
  • 3
  • 4
14

I agree with the os.walk solution

For pure pedantic purpose, try iterate over the generator object, instead of returning it directly:


def grab_files(directory):
    for name in os.listdir(directory):
        full_path = os.path.join(directory, name)
        if os.path.isdir(full_path):
            for entry in grab_files(full_path):
                yield entry
        elif os.path.isfile(full_path):
            yield full_path
        else:
            print('Unidentified name %s. It could be a symbolic link' % full_path)
thebat
  • 2,029
  • 4
  • 21
  • 30
11

Starting with Python 3.4, you can use the Pathlib module:

In [48]: def alliter(p):
   ....:     yield p
   ....:     for sub in p.iterdir():
   ....:         if sub.is_dir():
   ....:             yield from alliter(sub)
   ....:         else:
   ....:             yield sub
   ....:             

In [49]: g = alliter(pathlib.Path("."))                                                                                                                                                              

In [50]: [next(g) for _ in range(10)]
Out[50]: 
[PosixPath('.'),
 PosixPath('.pypirc'),
 PosixPath('.python_history'),
 PosixPath('lshw'),
 PosixPath('.gstreamer-0.10'),
 PosixPath('.gstreamer-0.10/registry.x86_64.bin'),
 PosixPath('.gconf'),
 PosixPath('.gconf/apps'),
 PosixPath('.gconf/apps/gnome-terminal'),
 PosixPath('.gconf/apps/gnome-terminal/%gconf.xml')]

This is essential the object-oriented version of sjthebats answer. Note that the Path.glob ** pattern returns only directories!

Community
  • 1
  • 1
gerrit
  • 24,025
  • 17
  • 97
  • 170
  • For people dealing with many files in directories, I believe this is the only truly iterative solution on this answer and possibly the only high-level way in the python(3) standard library. It should probably be added as an option to `iterdir()`. – KobeJohn Feb 01 '17 at 04:21
  • @KobeJohn Isn't `yield from alliter(sub)` within a generator `alliter` rather recursive than iterative? – gerrit Jun 29 '17 at 13:05
  • You are right. What I mean is that it gives you results without first doing a full stat on all the files in a directory. So even when you have a large number of files it can generate results immediately. – KobeJohn Jun 30 '17 at 05:06
2

os.scandir() is a "function returns directory entries along with file attribute information, giving better performance [than os.listdir()] for many common use cases." It's an iterator that does not use os.listdir() interally.

Flair
  • 2,609
  • 1
  • 29
  • 41
0

You can use path.py. Unfortunately the author's website is no longer around, but you can still download the code from PyPI. This library is a wrapper around path functions in the os module.

path.py provides a walkfiles() method which returns a generator iterating recursively over all files in the directory:

>>> from path import path
>>> print path.walkfiles.__doc__
 D.walkfiles() -> iterator over files in D, recursively.

        The optional argument, pattern, limits the results to files
        with names that match the pattern.  For example,
        mydir.walkfiles('*.tmp') yields only files with the .tmp
        extension.

>>> p = path('/tmp')
>>> p.walkfiles()
<generator object walkfiles at 0x8ca75a4>
>>> 
Mike Mazur
  • 2,509
  • 1
  • 16
  • 26
0

addendum to the answer of gerrit. I wanted to make something more flexible.

list all files in pth matching a given pattern, can also list dirs if only_file is False

from pathlib import Path

def walk(pth=Path('.'), pattern='*', only_file=True) :
    """ list all files in pth matching a given pattern, can also list dirs if only_file is False """
    if pth.match(pattern) and not (only_file and pth.is_dir()) :
        yield pth
    for sub in pth.iterdir():
        if sub.is_dir():
            yield from walk(sub, pattern, only_file)
        else:
            if sub.match(pattern) :
                yield sub
yota
  • 2,020
  • 22
  • 37