0

I get an array of paths (combined from default and user settings) and need to perform a recursive search for some data files which can be hidden between tens of thousands of of files in any of these paths.

I do the recursive search with a RecursiveDirectoryIterator but it is quite slow and the suggested alternative exec("find") is even slower. To save time, I/O and processing power I'd like to do some preprocessing beforehand to avoid searching directory trees multiple times and compute the smallest common denominator of the given paths. I would appreciate any advice on how to do this.

The catch is that any of the given paths might not only be ancestors of others or just symlinked into each other but might be given as either realpaths or paths to a symlink. At least one may assume that there won't be any circling symlinks (although a check wouldn't be bad).

I need to implement this in PHP and I sketched out the follwing Code, which doesn't cover all cases yet.

// make all given paths absolute and resolve symlinks
$search_paths = array_map( function($path) {
    return realpath( $path ) ?: $path;
}, $search_paths );

// remove all double entries
$search_paths = array_unique( $search_paths );

// sort by length of path, shortest first
usort($search_paths, function($a, $b) {
    return strlen($a) - strlen($b);
});

// iterate over all paths but the last
for ( $i = 0; $i < count( $search_paths ) - 1; $i++ ) {
    // iterate over all paths following the current
    for ( $j = $i; $j < count( $search_paths ); $j++ ) {
        if ( strpos ( $search_paths[$j], $search_paths[$i] ) === 0 ) {
            // longer path starts with shorter one, thus it's a child. Nuke it!
            unset( $search_paths[$j] );
        }
    }
}

Where this code falls short: Imagine these paths in $search_paths

/e/f
/a/b/c/d
/e/f/g/d

with /e/f/g/d being a symlink to /a/b/c/d.

The code above would leave these two:

/e/f
/a/b/c/d

but searching /e/f would actually be sufficient as it covers /a/b/c/d via the symlink /e/f/g/d. This might sound like an edge case but is actually quite likely in my situation.

Tricky, eh?

I'm pretty sure I'm not the only one with this problem but I couldn't find a solution using google. Maybe I just don't get the right wording to the problem.

Thanks for reading this far! :)

Community
  • 1
  • 1
wedi
  • 1,332
  • 1
  • 13
  • 28
  • 2
    You may be better off using `exec()` to run `find` or something similar. That would likely be faster, if you're not performing parsing or other operations that would require PHP. – Jeremiah Winsley Dec 17 '14 at 15:26
  • Yes, the parsing will be done after this pre-processing. Unfortunately `exec` is usually turned of on shared hosting plans. – wedi Jan 24 '15 at 21:43
  • Well, if you're stuck on shared hosting, you could look into something like http://php.net/manual/en/class.recursivedirectoryiterator.php, http://stackoverflow.com/questions/624120/is-it-possible-to-speed-up-a-recursive-file-scan-in-php – Jeremiah Winsley Jan 24 '15 at 21:49
  • Thanks for your help! Right now I am using the RecursiveDirectoryIterator, but it is quite slow, although it's already faster than `exec('find')` as your second link states. So with this question I'm looking for a way to preprocess the paths I need to scan to avoid scanning directories twice. – wedi Jan 24 '15 at 21:57
  • Have you checked that it actually is rescanning directories? I'm not very familiar with it, but scanning tens of thousands of files is bound to be slow even if it's not rescanning directories. – Jeremiah Winsley Jan 24 '15 at 22:04
  • True. I should definitely check that although my guess is that it cannot cache the result due to my manual checks on every file... Probably optimising the search is worth its own question. – wedi Jan 24 '15 at 22:08

0 Answers0