Is there a way to efficiently yield every file in a directory containing millions of files?

Question

I'm aware of os.listdir, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.

Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?

There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:

filea.txt
fileb.txt
filec.txt

Iterator yields filea.txt. During processing, filea.txt is renamed to filey.txt and fileb.txt is renamed to filez.txt. When the iterator attempts to get the next file, if it were to use the filename filea.txt to find it's current position in order to find the next file and filea.txt is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt when yielding filea.txt, it could look up the position of fileb.txt, fail, and produce an error.

If the iterator instead was able to somehow maintain an index dir.get_file(0), then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.

This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.

Edit:

The OS of concern is Redhat. My use case is this:

Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.

Edit:

Definition of valid:

Adjective 1. Well grounded or justifiable, pertinent.

(Sorry S.Lott, I couldn't resist).

I've edited the paragraph in question above.

I think there is no multiplatform native pyhton way to do that - on which operating system are you on? — jsbueno, Feb 23 '11 at 11:56
Is there actually a problem with reading a million filenames into memory? There are very few cases these days where memory usage is actually an issue... — Katriel, Feb 23 '11 at 12:53
@Josh Smeaton: A broad term term like "valid" is senseless in this context. The definition is not useful, since the term is so broad as to have no meaning. Clearly, it's hilarious to use broad, vague useless terms with a definition. — S.Lott, Feb 23 '11 at 23:37
@S.Lott, a failed attempt at humour perhaps. I figured by editing the question as I did, you'd have realised that I agreed with your observation, and attempted to enumerate theorised problems with a potential solution. Maybe I should have phrased the original question as 'are any of these theorised problems actual problems with a real implementation'. — Josh Smeaton, Feb 24 '11 at 00:58
@katrie @Jochen, there isn't really a problem with reading all the filenames into memory. It's not make or break. But if you have an easy to implement efficient solution, why would you go for a horribly inefficient one? In this particular case, I'm just going to read everything into memory and be done with it, until we merge the processes. — Josh Smeaton, Feb 24 '11 at 01:00
@Josh: is process A written in Python? Because if so it would be rather easier just to send the filenames between the processes. — Katriel, Feb 24 '11 at 11:19
@katrie No it's not - it's written in node.js. The current solution is a temporary measure until we can force the two processes into one. — Josh Smeaton, Feb 24 '11 at 11:57
Anyone getting here in 2015 and later, take a look at https://www.python.org/dev/peps/pep-0471/ - implemented in Python 3.5. TL;DR: from Python 3.5 on, just use `os.scandir` — jsbueno, Jul 07 '15 at 04:11

score 15 · Accepted Answer · edited May 07 '20 at 12:13

tl;dr <update>: As of Python 3.5 (currently in beta) just use os.scandir </update>

As I've written earlier, since "iglob" is just a facade for a real iterator, you will have to call low level system functions in order to get one at a time like you want. Fortunately, calling low level functions is doable from Python. The low level functions are different for Windows and Posix/Linux systems.

If you are on Windows, you should check if win32api has any call to read "the next entry from a dir" or how to proceed otherwise.
If you are on Posix/Linux, you can proceed to call libc functions straight through ctypes and get a file-dir entry (including naming information) a time.

The documentation on the C functions is here: http://www.gnu.org/s/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory

http://www.gnu.org/s/libc/manual/html_node/Reading_002fClosing-Directory.html#Reading_002fClosing-Directory

I have provided a snippet of Python code that demonstrates how to call the low-level C functions on my system but this code snippet may not work on your system[footnote-1]. I recommend opening your /usr/include/dirent.h header file and verifying the Python snippet is correct (your Python Structure must match the C struct) before using the snippet.

Here is the snippet using ctypes and libc I've put together that allow you to get each filename, and perform actions on it. Note that ctypes automatically gives you a Python string when you do str(...) on the char array defined on the structure. (I am using the print statement, which implicitly calls Python's str)

#!/usr/bin/env python2
from ctypes import *

libc = cdll.LoadLibrary( "libc.so.6")
dir_ = c_voidp( libc.opendir("/home/jsbueno"))

class Dirent(Structure):
    _fields_ = [("d_ino",  c_voidp),
                ("off_t", c_int64),
                ("d_reclen", c_ushort),
                ("d_type", c_ubyte),
                ("d_name", c_char * 2048)
            ]

while True:
    p  = libc.readdir64(dir_)
    if not p:
        break
    entry = Dirent.from_address( p)
    print entry.d_name

update: Python 3.5 is now in beta - and in Python 3.5 the new os.scandir function call is available as the materialization of PEP 471 ("a better and faster directory iterator") which does exactly what is asked for here, besides a lot other optimizations that can deliver up to 9 fold speed increase over os.listdir on large-directories listing under Windows (2-3 fold increase in Posix systems).

[footnote-1] The dirent64 C struct is determined at C compile time for each system.

I'm going to give the `os.listdir` method a try. If it yields unacceptable memory usage, I'll definitely give this a go. Great answer. — Josh Smeaton, Feb 23 '11 at 13:26
To iterate over newly written files, the notification method reported in @unutbu's answer would be more appropriate. — jsbueno, Feb 23 '11 at 14:48

Senthil Kumaran · Answer 2 · 2011-02-23T12:02:24.267

10

The glob module Python from 2.5 onwards has an iglob method which returns an iterator. An iterator is exactly for the purposes of not storing huge values in memory.

glob.iglob(pathname)
Return an iterator which yields the same values as glob() without
actually storing them all simultaneously.

For example:

import glob
for eachfile in glob.iglob('*'):
    # act upon eachfile

edited Feb 23 '11 at 12:02

answered Feb 23 '11 at 11:53

Senthil Kumaran

54,681
14
94
131

5

iglob appears to be a generator wrapper for glob.glob1 which returns a list. So the whole list is still loaded into memory. – Dunes Feb 23 '11 at 12:22
Dunes, true, I noticed it. It is calling a os.listdir (from posixmodule.c) which is as much like calling ls. This is good try in the first place and if this fails, an alternative should be looked upon. Thanks. – Senthil Kumaran Feb 23 '11 at 12:26
Anyway..it seens valid to opena feature request against bugs.python.org requesting iglob not to load the names globally. – jsbueno Feb 23 '11 at 12:29
If iglob is really just a generator around os.listdir, I think I'll just use listdir for my purposes. Good find though. – Josh Smeaton Feb 23 '11 at 13:24
I'd suggest to use scandir.walk(): https://www.python.org/dev/peps/pep-0471/ and https://pypi.python.org/pypi/scandir (works for python 2.7) – miguelfg Jan 21 '16 at 10:57

unutbu · Answer 3 · 2011-02-24T13:09:19.040

8

Since you are using Linux, you might want to look at pyinotify. It would allow you to write a Python script which monitors a directory for filesystem changes -- such as the creation, modification or deletion of files.

Every time such a filesystem event occurs, you can arrange for the Python script to call a function. This would be roughly like yielding each filename once, while being able to react to modifications and deletions.

It sounds like you already have a million files sitting in a directory. In this case, if you were to move all those files to a new, pyinotify-monitored directory, then the filesystem events generated by the creation of new files would yield the filenames as desired.

edited Feb 24 '11 at 13:09

answered Feb 23 '11 at 13:31

unutbu

842,883
184
1,785
1,677

Good one. I haven't attempted to the continuously changing of the files as in 'process A'writting files. Certiainly this is the way to go here. – jsbueno Feb 23 '11 at 14:47
Funnily enough, I already have a script to do exactly this - we use it to automatically reload our apache wsgi module when the code is changed. Excellent idea. – Josh Smeaton Feb 23 '11 at 22:47

score 8 · Answer 4 · answered Aug 11 '11 at 20:10

@jsbueno's post is really useful, but is still kind of slow on slow disks since libc readdir() only ready 32K of disk entries at a time. I am not an expert on making system calls directly in python, but I outlined how to write code in C that will list a directory with millions of files, in a blog post at: http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/.

The ideal case would be to call getdents() directly in python (http://www.kernel.org/doc/man-pages/online/pages/man2/getdents.2.html) so you can specify a read buffer size when loading directory entries from disk.

Rather than calling readdir() which as far as I can tell has a buffer size defined at compile time.

S.Lott · Answer 5 · 2011-02-23T23:37:45.327

What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.

No method will reveal a filename which "changed". It's not even clear what you mean by this "filenames change, new files are added, and files are deleted"? What is your use case?

Let's say you have three files: a.a, b.b, c.c.

Your magical "iterator" starts with a.a. You process it.

The magical "iterator" moves to b.b. You're processing it.

Meanwhile a.a is copied to a1.a1, a.a is deleted. What now? What does your magical iterator do with these? It's already passed a.a. Since a1.a1 is before b.b, it will never see it. What's supposed to happen for "filenames change, new files are added, and files are deleted"?

The magical "iterator" moves to c.c. What was supposed to happen to the other files? And how were you supposed to find out about the deletion?

Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.

Don't use the naked file system for coordination.

Use a queue.

Process A writes files and enqueues the add/change/delete memento onto a queue.

Process B reads the memento from queue and then does the follow-on processing on the file named in the memento.

I scrolled down to ""valid"? What does "valid" mean?" and immediately knew that it was you who'd written an answer :P. You raise good points though, I should have fleshed out that bit of my question more, and I would have realized that it didn't make a whole lot of sense in the context of my question. I had in my mind the problem of altering a collection during iteration being 'illegal' in some cases. — Josh Smeaton, Feb 23 '11 at 13:19
@Joshn Smeaton: "I should have fleshed out that bit of my question more". You still can. Please define "valid". Or consider revision the question to remove the undefined terms. — S.Lott, Feb 23 '11 at 14:05

Dunes · Answer 6 · 2011-02-23T12:43:55.750

I think what you are asking is impossible due to the nature of file IO. Once python has retrieved the listing of a directory it cannot maintain a view of the actual directory on disk, nor is there any way for python to insist that the OS inform it of any modifications to the directory.

All python can do is ask for periodic listings and diff the results to see if there have been any changes.

The best you can do is create a semaphore file in the directory which lets other processes know that your python process desires that no other process modify the directory. Of course they will only observe the semaphore if you have explicitly programmed them to.

"impossible" is relative - you can always call the underlying O.S> api - check my answer. — jsbueno, Feb 23 '11 at 12:53

Is there a way to efficiently yield every file in a directory containing millions of files?

6 Answers6

Linked

Related