-1

I have 3 files and I need to find out duplicate elements that appear in different files twice at-least.

Younes
  • 3
  • 1

1 Answers1

0

The simplest method is to read each file into memory and then compare the results.

For example given two lists you can do this to identify the difference (items in the first that are not in the second).

list(set(['foo', 'bar']) - set(['bar']))

So if you had three lists s1, s2, and s3 you could:

s1 = ['a', 'b', 'c', 'd']
s2 = ['b', 'c']
s3 = ['c', 'd']
list(set(s1) - set(s2) - set(s3))
// gives us ['a']

Now we can take that and apply it to reading the files.

This example makes a few assumptions: - you're comparing lines in the file. If this isn't accurate, you'll need to do your own list/set preparation after reading the files - You simply want to identify the unique lines, if you want to do something else with the duplicates you'll need to modify it accordingly.

with open('s1.txt') as f:
    s1 = f.readlines()
with open('s2.txt') as f:
    s2 = f.readlines()
with open('s3.txt') as f:
    s3 = f.readlines()

unique_lines = list(set(s1) - set(s2) - set(s3))
print(unique_lines)

Note: This is not particularly performant for large files/datasets but is sufficient for most simple examples.

Update: As per comments, to find the duplicates themselves, you can union the intersection between each set.

>>> s1 = set(['a', 'b', 'c', 'd'])
>>> s2 = set(['x', 'c'])
>>> s3 = set(['z', 'd'])
>>> s1 & s2
{'c'}
>>> s2 & s3
set()
>>> s3 & s1
{'d'}
>>> s1 & s2 | s2 & s3 | s3 & s1
{'d', 'c'}

Regarding your data size, unless you have specific low memory profile constraints, just be aware it might take a few hundred Mb memory when executing. This is because you have all three data sets in memory.

developerjack
  • 1,173
  • 6
  • 15