8

I'm trying to provide a unified interface for retrieving all files from a single directory or a list of directories.

def get_files(dir_or_dirs):
    def helper(indir):
        file_list = glob.glob("*.txt")
        for file in file_list:
            yield file

    if type(dir_or_dirs) is list:
        # a list of source dirs
        for dir in dir_or_dirs:
            yield helper(dir)
    else:
        # a single source dir
        yield helper(dir_or_dirs)

def print_all_files(file_iter):
    for file in file_iter:
        print(file)        # error here!

Questions:

  1. The error says 'file' is still a generator regardless of the input being a single dir or a list of it. Why is it still a generator?
  2. Is it possible to wrap or embed generators in functions? If so, how to make this work?
martineau
  • 119,623
  • 25
  • 170
  • 301
galactica
  • 1,753
  • 2
  • 26
  • 36
  • 2
    Why are you trying to `yield` the return value of `helper`? Helper is a generator function, and it returns a generator iterator. If you want to yield everything the generator yields, that's `yield from`. – user2357112 Jun 21 '17 at 21:35
  • was trying to get a generator for all files under all dirs. Thanks for the 'yield from' heads-up! – galactica Jun 21 '17 at 22:12

1 Answers1

14

You are yielding helper() each time:

yield helper(dir)

but helper() itself is a generator.

In Python 3.3 and newer, use yield from instead:

yield from helper(dir)

This delegates control to another generator. From the Yield expressions documentation:

When yield from <expr> is used, it treats the supplied expression as a subiterator. All values produced by that subiterator are passed directly to the caller of the current generator’s methods.

In older Python versions, including Python 2.x, use another loop:

for file in helper(dir):
    yield file

For more information on what yield from does, see PEP 380 -- Syntax for Delegating to a Subgenerator.

Not that you really need the helper function, it does little more than just loop over the glob.glob() results, you can do that directly.

You also need to correct your function to actually use indir; currently you are ignoring that argument, so you only get text files from the current working directory.

Next, you want to use glob.iglob() instead of glob.glob() to get the lazy evaluation over os.scandir() rather than load all results into memory at once. I'd just turn a non-list dir_or_dirs value into a list, then just use one loop:

import glob
import os.path

def get_files(dirs):
    if not isinstance(dirs, list):
        # make it a list with one element
        dirs = [dirs]

    for dir in dirs:
        pattern = os.path.join(dir, '*.txt')
        yield from glob.iglob(pattern)

Now, instead of a single argument that is either a string or a list, I'd use a variable number of arguments instead, with the *args parameter syntax:

def get_files(*dirs):
    for dir in dirs:
        pattern = os.path.join(dir, '*.txt')
        yield from glob.iglob(pattern)

This can be called with 0 or more directories:

for file in get_files('/path/to/foo', '/path/to/bar'):
    # ...
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • If you just want to know if you can loop over something, `isinstance(dirs, collections.abc.Iterable)` is a better check to use, since that works for any iterable, not just list. A more pythonic way might be to take dirs as a vararg: `def get_files(*dirs)`, then the caller can call with a single argument, varargs, or a list by way of `get_dirs(*iterable_argument)`. – zstewart Jun 21 '17 at 21:52
  • 2
    @zstewart: no, using `collections.abc.Iterable` is *not* a better check, because strings are iterables too. A single directory would then be treated as separate individual characters. – Martijn Pieters Jun 21 '17 at 21:57
  • Thanks! the py2.7 tips worked! I'll try with py3 too. Also thanks for the illustration and further tips for improve the python code quality! – galactica Jun 21 '17 at 22:11
  • @MartijnPieters Oh, right. Duh. I would personally still pick a solution which works for any iterable. `*dirs` might be the best option, since it leave open duck typing of both individual arguments and the iterable argument. Though it does mean that any iterable argument is evaluated eagerly. Another option is to explicitly check for the types of arguments accepted by `os.path.join` (`str`, `bytes`, and in 3.6+ `os.PathLike`) and only convert those to a new single-element list/tuple. Then `get_files` would be able to lazily evaluate a lazy iterator passed to it. – zstewart Jun 22 '17 at 12:26
  • @zstewart: at which point you just put your foot down and have the API only accept a single iterable of directories, and simply avoid ducktyping. – Martijn Pieters Jun 22 '17 at 13:08