Python\\ How to find out duplicate elements in 3 files

Question

I have 3 files and I need to find out duplicate elements that appear in different files twice at-least.

developerjack · Accepted Answer · 2019-11-30T22:29:42.147

The simplest method is to read each file into memory and then compare the results.

For example given two lists you can do this to identify the difference (items in the first that are not in the second).

list(set(['foo', 'bar']) - set(['bar']))

So if you had three lists s1, s2, and s3 you could:

s1 = ['a', 'b', 'c', 'd']
s2 = ['b', 'c']
s3 = ['c', 'd']
list(set(s1) - set(s2) - set(s3))
// gives us ['a']

Now we can take that and apply it to reading the files.

This example makes a few assumptions: - you're comparing lines in the file. If this isn't accurate, you'll need to do your own list/set preparation after reading the files - You simply want to identify the unique lines, if you want to do something else with the duplicates you'll need to modify it accordingly.

with open('s1.txt') as f:
    s1 = f.readlines()
with open('s2.txt') as f:
    s2 = f.readlines()
with open('s3.txt') as f:
    s3 = f.readlines()

unique_lines = list(set(s1) - set(s2) - set(s3))
print(unique_lines)

Note: This is not particularly performant for large files/datasets but is sufficient for most simple examples.

Update: As per comments, to find the duplicates themselves, you can union the intersection between each set.

>>> s1 = set(['a', 'b', 'c', 'd'])
>>> s2 = set(['x', 'c'])
>>> s3 = set(['z', 'd'])
>>> s1 & s2
{'c'}
>>> s2 & s3
set()
>>> s3 & s1
{'d'}
>>> s1 & s2 | s2 & s3 | s3 & s1
{'d', 'c'}

Regarding your data size, unless you have specific low memory profile constraints, just be aware it might take a few hundred Mb memory when executing. This is because you have all three data sets in memory.

thanks for your quik response. I just want to add two small remarks 1 / my files are big size 80 Mo around. 2/ the above code print out the diffrence, in my case i need to print out the similar elements — Younes, Nov 30 '19 at 15:14
for exp: s1 = ['a', 'b', 'c', 'd'] s2 = ['x', 'c'] s3 = ['z', 'd'] printout wanted // gives us ['c', 'd'] — Younes, Nov 30 '19 at 15:20
I've added union and intersection examples to my answer as well. — developerjack, Nov 30 '19 at 22:30

Python\\ How to find out duplicate elements in 3 files

1 Answers1