2

Let's say I have the following:

src = itertools.chain(*map(lambda t: map(lambda u: ((t[0], ) + os.path.splitext(u)), t[2]), os.walk(src_folder)))
dst = itertools.chain(*map(lambda t: map(lambda u: ((t[0], ) + os.path.splitext(u)), t[2]), os.walk(dst_folder)))

This creates two lists of the format [(folder, base name, ext)] for two directories.

I want to be able to find the files that are common in src and dst. I can do this with set(src) & set(dst) as documented. But, what if I want to do it only by the folder and base name, and not by the extension? In other words, what if I want to do set intersection by a custom rule/function? How do I go about doing this?

  • You can create a custom class which contains these fields `(folder, base name, ext)` and then define a custom comparator for this class. You might even be able to turn this into a general utility. Have a look at https://stackoverflow.com/questions/16306844/custom-comparison-functions-for-built-in-types-in-python – lakshayg Jul 11 '18 at 00:16
  • 1
    @LakshayGarg A custom `__lt__` isn't going to do any good; the key is a custom `__hash__` (and of course an `__eq__` that matches). Which is doable, but it's probably more work than you'd want to do just for this case. Although maybe with 3.7 and `@dataclass` it isn't. – abarnert Jul 11 '18 at 00:18
  • I agree. I am not familiar with how python sets work. But how about doing something like using the same `__hash__` and `__eq__` as of `(folder, base_name)`. What I'm trying to say is that you won't have a hard time defining those functions if you decide to make this into a reusable utility. But again, I'm no expert ;) – lakshayg Jul 11 '18 at 00:22
  • The same question, answered from an object-oriented perspective, is found here: https://stackoverflow.com/q/37846222/9352077 – Mew Jul 31 '21 at 18:57

1 Answers1

2

In other words, what if I want to do set intersection by a custom rule/function? How do I go about doing this?

You can't. The whole reason set intersection is so fast and simple is that Python can immediately check whether a value is an element of a set, without having to compare it to all of the elements of the set.

But what you can do is transform the sets as you build them, then intersect those:

{os.path.basename(path) for path in src} & {os.path.basename(path) for path in dst}

The problem is, this doesn't give you the full names whose basenames are in the intersection, it only gives you the basenames that are in the intersection. How can you fix that?

The easiest solution is to use a dict instead of a set. You can then use its keys view as a set, and then go back and get the corresponding values:

srcmap = {os.path.basename(path): path for path in src}
srcisect = srcmap.keys() & {os.path.basename(path) for path in dst}
result = {srcmap[key] for key in srcisect}

This may look like a lot more work, but it's really just 4 linear passes instead of 3 (and the extra one is just over the intersection rather than one of the original lists), so at worst the performance will only be worse by a small constant factor.

abarnert
  • 354,177
  • 51
  • 601
  • 671