Using an index to recursively get all files in a directory really fast

Question

Attempt #2:

People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:

1) Reading a list of files is much faster than walking a directory.

2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.

3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.

Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.

Original post:

I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.

Here's my idea for how to go about this:

I have a function called all_files:

def all_files(dir_path, ...parms...):
    ...

The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.

This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.

Finally, I'll take the list of all files, filter it according to various parameters, and return the result.

What do you think of my approach? Is there a better way to do this?

Edit:

Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.

import os
from os.path import join, exists
import cProfile, pstats

dir_name = "temp_dir"
index_path = ".index"

def create_test_files():
    os.mkdir(dir_name)
    index_file = open(index_path, 'w')
    for i in range(10):
        print "creating dir: ", i
        sub_dir = join(dir_name, str(i))
        os.mkdir(sub_dir)
        for i in range(100):
            file_path = join(sub_dir, str(i))
            open(file_path, 'w').close() 
            index_file.write(file_path + "\n")
    index_file.close()
#

#  0.238 seconds
def test_walk():            
    for info in os.walk("temp_dir"):
        pass

#  0.001 seconds
def test_read():
    open(index_path).readlines()

if not exists("temp_dir"):
    create_test_files()

def profile(s):
    cProfile.run(s, 'profile_results.txt')
    p = pstats.Stats('profile_results.txt')
    p.strip_dirs().sort_stats('cumulative').print_stats(10)

profile("test_walk()")
profile("test_read()")

I don't like the "index the entire filesystem on startup" bit. I think it's pretty obvious why that's a bad idea on today's mega-gigabyte hard drives. — Anon., Jan 13 '10 at 20:19
1) Is this really what you want to do? 2) Just walking the directory doesn't seem like it would be much slower than what you are suggesting as cache. Suggestion: build now, refactor later. — wprl, Jan 13 '10 at 20:20
Also, since the file system is constantly changing, how does this help? — S.Lott, Jan 13 '10 at 20:21
So you essentially want to recreate the file management system. I don't think much of your approach. I'd stick with documented interfaces and allow the user to cancel out of a long running process. — No Refunds No Returns, Jan 13 '10 at 20:21
@Anon: I'm not indexing the entire filesystem. I'm only indexing a particular directory. I track *changes* to the filesystem on startup. — Jesse Aldridge, Jan 13 '10 at 20:22
@SoloBold I did I bit of profiling initially and got ~10x speed improvement by reading a cached list of files over walking through the dir. Updating the index might cause some slowdown, but I think my method has promise. — Jesse Aldridge, Jan 13 '10 at 20:24
Jesse, even if you check the changes at startup, it is still going to get out-of-sync as the filesystem is used throughout the day. — Adam Crossland, Jan 13 '10 at 20:24
@S.Lott That's a good point... I guess updating the index would probably be the main bottleneck. But if the filesystem didn't change too much between updates to the index, this could be a significant speed up... maybe. — Jesse Aldridge, Jan 13 '10 at 20:30
Jesse: 1) implement your solution, 2) fix all the bugs, 3) find and fix the corner cases you missed originally, 4) fix all the new bugs introduced, 5) implement calling os.walk, 6) compare the difference between them, 7) end up using os.walk. I'd just skip steps 1-4 and 6; you'll get better ROI wrt performance in other areas, since you're using Python. You might also be trying to implement a RDBMs and not yet realize it; if so, use sqlite. — , Jan 13 '10 at 20:41
@No Refunds I don't know about that. All I know is reading a file is much faster than walking a dir. I added a code sample to my question in an attempt to demonstrate this. — Jesse Aldridge, Jan 13 '10 at 20:57
@Adam I meant I *launch* the filesystem monitor on startup. It continues to track changes in the background. — Jesse Aldridge, Jan 13 '10 at 20:59
@Roger Actually, I already did that. I have my function implemented with os.walk. It's stable. But it's slow. It's a bottleneck in my current project. No premature optimization here. I actually realize I might be reinventing some kind of database thing. I have very little experience with this sort of thing, and would like to hear any more specific advice along those lines. — Jesse Aldridge, Jan 13 '10 at 21:02
"Didn't change too much between updates?" What can that possibly mean? It changes and your results are *wrong*. Not a little wrong, but *wrong*. — S.Lott, Jan 13 '10 at 21:09
@S.Lott I track all filesystem changes. I can, in theory, make the results always right. I was saying that more changes to the filesystem between index updates means more lines added to the modification log and hence a longer time needed to update the index. If the filesystem doesn't change to much, than not too many lines will need to be tailed and hence updating the index wouldn't be too costly. — Jesse Aldridge, Jan 13 '10 at 21:14
You can't, in theory or in reality, make the results always correct. You have two sets of information -- one is reality and one is a snapshot of reality. The snapshot will always be disjoint from reality. You will not know by how much, an the work that you do based on it will not be accurate. — Adam Crossland, Jan 13 '10 at 21:44
Can you please come back up a level or three out of the technicalities and tell us things like how many files there are in this directory tree (I presume that's what you meant by "recursively") and how many times per day this scanning needs to happen and what is the actual purpose? Do you have no control over creating/modifying those files? — John Machin, Jan 13 '10 at 23:51
This is a general function for getting all the files in any given directory. So no assumptions about the number of files. Not sure what you mean by "scanning", but the directory needs to be walked exactly once: the first time all_files is called and the index is built. I'll edit the question to be more clear. — Jesse Aldridge, Jan 14 '10 at 02:53
Your edit added nothing to most folk's understanding. They know what you said and just don't agree that it's a good idea. And you didn't answer ANY of my questions. "Scanning" means iterating over a bunch of information whether by os.walk or by reading TWO files (index file and monitor log file). — John Machin, Jan 14 '10 at 04:16
Funnily enough, there's a cool Unix utility called `locate` which seems to do what the OP wanted in a manner much like what the OP wanted and is widely considered useful... There's the obvious difference of `locate` being upfront about using a snapshot of the fs which will tend to be partially out of date, but you could conceivably have a `locate` -like utility more diligent in its efforts to keep up to date on a particular directory (with per-directory cron jobs or whatever). Anyway, there's no need to be quite so dismissive about the idea... — Michał Marczyk, Jan 14 '10 at 07:49
Which is not to say that using two text files to store the file info is necessarily the best idea. Perhaps an small db would be better or maybe it's simply infeasible to write `locate` in Python as opposed to `C` (why would it be, though?). Any opinions on this would be cool to read. — Michał Marczyk, Jan 14 '10 at 07:51
@Jesse: Sigh; your first 3 sentences added nothing new. Before you sod off, can you answer just one tiny question: The application/transaction/task/whatever is run N times a day, it takes E seconds of elapsed time of which W seconds are occupied by os.walk() -- what are typical values for N, E, and W? — John Machin, Jan 14 '10 at 08:57
Michał: Yes! I had forgotten about locate. That's very close to what I want. I found a Windows version here: http://locate32.net/index.php Man, it's *really* fast -- instantaneous search of all the filenames on my hard drive. I may be able to use it directly in some cases. Either way, it's a great proof of concept. Thank you, sir. You are a light in the darkness. — Jesse Aldridge, Jan 14 '10 at 09:53

Adam Crossland · Answer 1 · 2010-01-13T21:35:34.720

7

Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.

Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.

Flaws and potential problems:

You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.

The filesystem monitor program still has to recursively walk the file system, so the work is still being done.

In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.

Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.

I could go on.

If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.

edited Jan 13 '10 at 21:35

answered Jan 13 '10 at 20:20

Adam Crossland

14,198
3
44
54

That's a good point. The FS is already doing this more or less. – wprl Jan 13 '10 at 20:21
It means you can't build universal solution. But if you know something your FS doesn't know (like the files only get updated at 5pm daily), you can use this knowledge to cache the information you need from FS. – Antony Hatchkins Jan 13 '10 at 20:25
Look at the code sample I added above. There is a clear, drastic improvement from reading a list of files over walking a dir. Please elaborate on the flaws you see. – Jesse Aldridge Jan 13 '10 at 21:07
The flaw is that your profiling is completely dishonest. (Though, not intentionally.) You compared an os.walk to reading a file. However, the file was created by an equivalent to os.walk. Your performance is going to be (a) os.walk or (b) os.walk to create an index + read the index. When you profiled it, you didn't count the all of the work in (b). Your testing setup did the vast majority of the work of (b), and then you only profiled the last tiny step. – Travis Bradshaw Jan 13 '10 at 21:14
Jesse, my flaws are going to be listed in my answer to your question. – Adam Crossland Jan 13 '10 at 21:18
@Travis Yes, but I only need to create the index file *the first time*. On subsequent calls I can read the index file and avoid the walk. That's what caching is all about. – Jesse Aldridge Jan 13 '10 at 21:24
Jesse, caching is only as good as the probability that the information that is cached is still accurate and useful. – Adam Crossland Jan 13 '10 at 21:41
@Adam Thanks for the elaboration, but... 1) Seeing as how I'll be updating the index immediately before I return the list of files, the risk of being out of sync seems about the same as using os.walk. 2) No it doesn't. The *indexer* recursively walks the first time. The *monitor* is hooked up to Windows signals. 3) No I don't. Again, Windows signals. I should have mentioned that in my question, sorry. 4) The index is updated by the all_files function just before returning. There will be some slowdown, but I suspect it will still be significantly faster than walking the dir. – Jesse Aldridge Jan 13 '10 at 21:47

score 3 · Accepted Answer · edited May 23 '17 at 12:19

The best answer came from Michał Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.

Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.

score 2 · Answer 3 · answered Jan 13 '10 at 20:41

2

Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.

Of course it's much faster than walking the directory.

answered Jan 13 '10 at 20:41

tback

11,138
7
47
71

Thank you for apparently being the only person on StackO to understand that. I hadn't thought of looking at Windows Search. It does indeed have indexing options. But something tells me trying to integrate that indexing with my function would be more trouble than it's worth... – Jesse Aldridge Jan 13 '10 at 21:19
The hard part is indeed to keep the index in sync. I'd assume that you are better of if you use the index that is already there. – tback Jan 14 '10 at 08:44

Travis Bradshaw · Answer 4 · 2010-01-14T07:11:43.020

The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.

"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)

There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.

To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.

Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.

There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.

However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.

Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.

(Edit in response to "I don't think you understand what I'm trying to do")

Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.

Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.

You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.

If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?

That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.

But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.

Leveraging the Windows Desktop Search indexing is a nice idea. But I have no idea how to do that. Also, my method really isn't all that complicated or hard to implement. // I think you're misunderstanding what I'm trying to do. I've restated my question in an attempt to be more clear. The thing is I only need to write the index *the first time* I call the function and *subsequent calls* are sped up because I can just read the index and no longer need to walk. I keep the index up to date by applying deltas from the monitor on subsequent calls. — Jesse Aldridge, Jan 14 '10 at 03:35

score 0 · Answer 5 · answered Jan 13 '10 at 20:33

0

I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.

As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.

answered Jan 13 '10 at 20:33

jathanism

33,067
9
68
86

Yes, I'm already using walk and stat. But my function is slow and I think this could make it significantly faster. – Jesse Aldridge Jan 13 '10 at 21:09
Ah, ok then. You might want to consider one of the awesome search apps out there that operate in a Django-esque ORM style. There are a few listed here, the most popular of which seems to be Whoosh: http://haystacksearch.org/docs/installing_search_engines.html – jathanism Jan 13 '10 at 22:16
I've actually used Whoosh and SOLR. I think they are more suited to full text search than retrieving all files and filtering on attributes. I don't think something like that would work well for this case. – Jesse Aldridge Jan 13 '10 at 22:41
Ahh, that's a bummer. Well, sorry I couldn't help, I was thinking that indexing features would be useful. – jathanism Jan 13 '10 at 23:06

score 0 · Answer 6 · answered Jan 13 '10 at 22:21

I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.

class DirectoryIterator:
    def __init__(self, start_dir, pattern):
        self.directory = start_dir
        self.pattern = pattern

 def __iter__(self):
     [([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]

 ###########

 for file_name in DirectoryIterator(".", "\.py$"): print file_name

Ah, but the bottleneck is os.walk, and your example still needs to call that. — Jesse Aldridge, Jan 13 '10 at 22:44

Using an index to recursively get all files in a directory really fast

6 Answers6