String compare in python

Question

I'm looking for duplicate files by compare the filenames.

However, I found some paths returned by os.walk contain escaped chars. For example, I may get structure in the Earth\'s core.pdf for one file and structure in the Earth\xe2\x80\x99s core.pdf for another.

In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf

In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False

How do I deal with these cases?

==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like

one filename containing more spaces than the other
one filename separated by - while the other by :
one filename containing Japanese/Chinese words and the other composed of digits and Japanese/Chinese words ...

they are two different characters... `'` is not equal to `’`. You replace one with the other or compare only the alpha-numerics of a given sentence. — kaza, Oct 06 '17 at 19:43
They aren't the *same*, because they are using different encoding to create the same _general_ visual appearance. c.f. [this](https://stackoverflow.com/questions/32761954/how-to-decode-an-ascii-string-with-backslash-x-x-codes) link for a similar discussion. They are different characters, as @bulbus notes. Fixing that is complicated, as it opens a can of worms about how many possible ways there are to say something that is intellectually similar, but not literally the same. — Paul Hodges, Oct 06 '17 at 19:44
You might try boiling them down to "dictionary" representation, stripping out all the non-alphanumerics before comparing, and writing a report. — Paul Hodges, Oct 06 '17 at 19:47
I know `'` is not `’`. But the two files are the same. For some reasons they were not named exactly the same. There are other situations like **one filename containing more spaces than the other**, **one filename separated by `-` while the other by `:`**, **some filenames containing non-letter chars as Japanese/Chinese words**... These's my difficulties now. — wsdzbm, Oct 06 '17 at 19:58
@bulbus I don't have such a collection including all possible char pairs like this. Comparing letters and digits only may be a workaround. — wsdzbm, Oct 06 '17 at 20:03
@PaulHodges Thx, sounds like a solution, except for some files named mainly by non-English words. what do you mean by "writing a report"? — wsdzbm, Oct 06 '17 at 20:06
Just skip files with questionable matches until a human can evaluate them. For the most thorough and exact comparison, use `cmp` (the program - I guess the equivalent in python might be to read both into vars and compare them?) If they are byte-exact, then the names don't matter. It's the same file. — Paul Hodges, Oct 06 '17 at 20:11

Arthur Gouveia · Accepted Answer · 2017-10-06T20:15:18.777

Maybe you can get the similarity of the strings instead of an exact match. Get the exact match can be problematic because of simple things like capitalization.

I suggest the following:

from difflib import SequenceMatcher

s1 = "structure in the Earth\'s core.pdf"
s2 = "structure in the Earth\xe2\x80\x99s core.pdf"

matcher = SequenceMatcher()
matcher.set_seqs(s1, s2)
print(matcher.ratio())
# 0.9411764705882353

This result shows that the similarity between both strings is over 94%. You could define a threshold to delete or to review the items before deletion.

String compare in python

1 Answers1