Faster way to find large files with Python?

Question

I am trying to use Python to find a faster way to sift through a large directory(approx 1.1TB) containing around 9 other directories and finding files larger than, say, 200GB or something like that on multiple linux servers, and it has to be Python.

I have tried many things like calling du -h with the script but du is just way too slow to go through a directory as large as 1TB. I've also tried the find command like find ./ +200G but that is also going to take foreeeever.

I have also tried os.walk() and doing .getsize() but it's the same problem- too slow. All of these methods take hours and hours and I need help finding another solution if anyone is able to help me. Because not only do I have to do this search for large files on one server, but I will have to ssh through almost 300 servers and output a giant list of all the files > 200GB, and the three methods that i have tried will not be able to get that done. Any help is appreciated, thank you!

Unfortunately, there isn't a way of doing so. For example, a program called Everything indexes your entire drive and you can sort the files based on a filter. Even this program is unable to find this "fast way". Python shouldn't be any different. Using a different language wouldn't change anything either because the os.walk and .getsize() is mainly dependent on the operating system, and it mainly runs non-Python code. — WorkingRobot, Sep 10 '17 at 19:55
I don't think python is going to increase your disk read speed. I think what you're looking for is to parallelize so you can check every server at the same time. Then it should only take a few hours total. — Robert Seaman, Sep 10 '17 at 19:55
@RobertSeaman I'm not really familiar with the parallelize concept, are there any links for info that you could send my way? — user7439019, Sep 12 '17 at 00:35

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

That's not true that you cannot do better than os.walk()

scandir is said to be 2 to 20 times faster.

From https://pypi.python.org/pypi/scandir

Python’s built-in os.walk() is significantly slower than it needs to be, because – in addition to calling listdir() on each directory – it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we’re not talking about micro-optimizations.

From python 3.5, thanks to PEP 471, scandir is now built-in, provided in the os package. Small (untested) example:

for dentry in os.scandir("/path/to/dir"):
    if dentry.stat().st_size > max_value:
       print("{} is biiiig".format(dentry.name))

(of course you need stat at some point, but with os.walk you called stat implicitly when using the function. Also if the files have some specific extensions, you could perform stat only when the extension matches, saving even more)

And there's more to it:

So, as well as providing a scandir() iterator function for calling directly, Python's existing os.walk() function can be sped up a huge amount.

So migrating to Python 3.5+ magically speeds up os.walk without having to rewrite your code.

From my experience, multiplying the stat calls on a networked drive is catastrophic performance-wise, so if your target is a network drive, you'll benefit from this enhancement even more than local disk users.

The best way to get performance on networked drives, though, is to run the scan tool on a machine on which the drive is locally mounted (using ssh for instance). It's less convenient, but it's worth it.

The 2x to 20x speedup`applies to Windows, not to Linux (which the OP tagged). For the OP it is not enough to tell apart files from directories, the code must actually `stat` the file. Besides, the OP was clear that `du` was also too slow, which already optimizes the directory walk as much as possible. There will be no "magical speedup" from `scandir`, the strategy must be changed. — user4815162342, Sep 11 '17 at 09:14
"But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short" so speedup applies to linux as well. But if `du` is already slow, yes, the problem may be elsewhere, I edited my conclusion. — Jean-François Fabre, Sep 11 '17 at 09:24
As I said, it is not enough to the OP to tell apart files from directories, there is also the need to obtain the size of files. I suppose there could be a minor speedup in case when there are many directories and few files in the hierarchy, which would eliminate the need to `stat` the entries known to be directories. But that does not seem unlikely and, even so, it wouldn't make a difference in the use case presented. — user4815162342, Sep 11 '17 at 10:48
I cannot agree more, but as mentionned in my answer, if there's a way to tell the big files apart from the rest by name (ex: `*.bin`) which is not uncommon, then time can be saved, even compared to `du`. — Jean-François Fabre, Sep 11 '17 at 11:45

score 0 · Answer 2 · answered Sep 10 '17 at 19:59

It is hard to imagine that you will find a significantly faster way to traverse a directory than os.walk() and du. Parallelizing the search might help a bit in some setups (e.g. SSD), but it won't make a dramatic difference.

A simple approach to make things faster is by automatically running the script in the background every hour or so, and having your actual script just pick up the results. This won't help if the results need to be current, but might work for many monitoring setups.

Faster way to find large files with Python?

2 Answers2

Linked