Efficient design to store lookup table for files in directories

Question

Let's say I have three directories dir1, dir2 & dir3, with thousands of files in each. Each file has a unique name with no pattern.

Now, given a filename, I need to find which of the three directories it's in. My first thought was to create a dictionary with the filename as key and the directory as the value, like this:

{'file1':'dir1', 
 'file2':'dir3',
 'file3':'dir1', ... }

But seeing as there are only three unique values, this seems a bit redundant and takes up space.

Is there a better way to implement this? What if I can compromise on space but need faster lookup?

You're already comprising on space. Lookup performance is as fast as it can get with a dictionary. — Moses Koledoye, Nov 01 '17 at 22:40
Have you done any actual testing which proves that performance is going to be a real issue? If there's only three directories, why not just use the file-system to check whether they contain the file? — ekhumoro, Nov 01 '17 at 23:02
@ekhumoro I just used 3 directories for this example. In practice I have much more. — HMK, Nov 01 '17 at 23:19
There are of course many ways to do a faster lookup, but whether it's worth going down the path is a function of the number of files you have. For instance, to reduce the collision time, you may want to partition your index based on the prefix of the filename. — axiom, Nov 01 '17 at 23:30
@MosesKoledoye yes, but the dictionar(y/ies) may be created in a way so that the lookup of the system is faster. — axiom, Nov 01 '17 at 23:30
@HMK. No matter how many directories there are, if you haven't tested anything, this is all just premature optimisation. — ekhumoro, Nov 02 '17 at 00:27
@ekhumoro in this particular case, a simple count of the number of files (not so much the number of directories) should be able to tell if there is any need to optimize. For example, if the avg filename length is 10, and there are 100 million of them, we know there is going to be a problem. — axiom, Nov 02 '17 at 01:01
@axiom. The number of *files* is totally irrelevant. The only thing that matters is the number of directories. All you need to do is join each directory with the target filename, and then check if `os.path.exists(path)`. Even with several hundred directories, this should take only a fraction of second. — ekhumoro, Nov 02 '17 at 18:04
@ekhumoro it's absolutely relevant. It's the key, and thus the number of filenames is directly proportional to the time map.get() will take. — axiom, Nov 02 '17 at 19:49
@axiom. My suggestion is to use **only** the file-system, and not to use a `dict` at all. That will easily be fast enough for most use-cases (unless there are many thousands of directories to check). — ekhumoro, Nov 02 '17 at 19:56
@ekhumoro Can you link to an example of using the filesystem? Thanks! — HMK, Nov 03 '17 at 14:41
@HMK. As requested, I posted an answer with an example and explanation. If it isn't helpful, can you explain why, so that I can try to improve it? — ekhumoro, Nov 07 '17 at 20:40

ekhumoro · Accepted Answer · 2017-11-03T17:55:59.397

A simple way to solve this is to query the file-system directly instead of caching all the filenames in a dict. This will save a lot of space, and will probably be fast enough if there only a few hundred directories to search.

Here is a simple function that does that:

def find_directory(filename, directories):
    for directory in directories:
        path = os.path.join(directory, filename)
        if os.path.exists(path):
            return directory

On my Linux system, when searching around 170 directories, it takes about 0.3 seconds to do the first search, and then only about 0.002 seconds thereafter. This is because the OS does file-caching to speed up repeated searches. But note that if you used a dict to do this caching in Python, you'd still have to pay a similar initial cost.

Of course, the subsequent dict lookups would be faster than querying the file-system directly. But do you really need that extra speed? To me, two thousandths of second seems easily "fast enough" for most purposes. And you get the extra benefit of never needing to refresh the file-cache (because the OS does it for you).

PS:

I should probably point out that the above timings are worst-case: that is, I dropped all the system file-caches first, and then searched for a filename that was in the last directory.

+ the FS cache will get build even if you use a dict, which will be way less memory efficient, so you will end up having a cache and an ineficient (memory wise) cache (aka dict) and only use one of them. — Adirio, Nov 03 '17 at 15:41

score 1 · Answer 2 · answered Nov 03 '17 at 18:44

You can store the index as a dict of sets. It might be more memory-efficient.

index = {
    "dir1": {"f1", "f2", "f3", "f4"},
    "dir2": {"f3", "f4"},
    "dir3": {"f5", "f6", "f7"},
}

filename = "f4"
for dir, files in index.iteritems():
    if filename in files:
         print dir

Speaking of thousands of files, you'll barely see any difference between this method and your inverted index.

Also, repeatable strings in python can be interned in order to save memory. Sometimes CPython interns short string itself.

Efficient design to store lookup table for files in directories

2 Answers2