0

I have two sets of paths, with maybe 5000 files in the first set and 10000 files in the second. The first set is contained in the second set. I need to check if any of the entries in the second set is a child of any entry in the first set (i.e. if it's a subdirectory or file in another directory from the first set). There are some additional requirements:

  • No operations on the file system, it should be done only on the path strings (except for dealing with symlinks if needed).
  • Platform independent (e.g. upper/lower case, different separators)
  • It should be robust with respect to different ways of expressing the same path.
  • It should deal with both symlinks and their targets.
  • Some paths will be absolute and some relative.
  • This should be as fast as possible!

I'm thinking along the lines of getting both os.path.abspath() and os.path.realpath() for each entry and then comparing them with os.path.commonpath([parent]) == os.path.commonpath([parent, child]). I can't come up with a good way of running this fast though. Or is it safe to just compare the strings directly? That would make it much much easier. Thanks!

EDIT: I was a bit unclear about the platform independence. It should work for all platforms, but there won't be for example Windows and Unix style paths mixed.

Rasmus
  • 69
  • 7

1 Answers1

0

You can first calculate the real path of all paths using os.path.realpath and then use os.path.commonprefix to check if one path in a child of the first set of paths.

Example:

import os

first = ['a', 'b/x', '/r/c']
second = ['e', 'b/x/t', 'f']

first = set(os.path.realpath(p) for p in first)
second = set(os.path.realpath(p) for p in second)

for s in second:
    if any(os.path.commonprefix([s, f]) == f
           for f in first):
        print(s)

You get:

/full/path/to/b/x/t
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
  • That scales quadratically, so it gets too slow for many files. In my example that would be 5000 * 10000 = 50 million checks. – Rasmus May 17 '18 at 22:16
  • Also, if it were enough to check for common prefix then you could get a sorted list from the set and then iterate over that and check if `i` is a parent of `i+1`. – Rasmus May 17 '18 at 22:21
  • 1
    @Rasmus But how long do the 50 million checks take? – Kodiologist May 17 '18 at 23:03