1

I'm looking for duplicate files by compare the filenames.

However, I found some paths returned by os.walk contain escaped chars. For example, I may get structure in the Earth\'s core.pdf for one file and structure in the Earth\xe2\x80\x99s core.pdf for another.

In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf

In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False

How do I deal with these cases?

==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like

  • one filename containing more spaces than the other
  • one filename separated by - while the other by :
  • one filename containing Japanese/Chinese words and the other composed of digits and Japanese/Chinese words ...
wsdzbm
  • 3,096
  • 3
  • 25
  • 28
  • they are two different characters... `'` is not equal to `’`. You replace one with the other or compare only the alpha-numerics of a given sentence. – kaza Oct 06 '17 at 19:43
  • 1
    They aren't the *same*, because they are using different encoding to create the same _general_ visual appearance. c.f. [this](https://stackoverflow.com/questions/32761954/how-to-decode-an-ascii-string-with-backslash-x-x-codes) link for a similar discussion. They are different characters, as @bulbus notes. Fixing that is complicated, as it opens a can of worms about how many possible ways there are to say something that is intellectually similar, but not literally the same. – Paul Hodges Oct 06 '17 at 19:44
  • You might try boiling them down to "dictionary" representation, stripping out all the non-alphanumerics before comparing, and writing a report. – Paul Hodges Oct 06 '17 at 19:47
  • I know `'` is not `’`. But the two files are the same. For some reasons they were not named exactly the same. There are other situations like **one filename containing more spaces than the other**, **one filename separated by `-` while the other by `:`**, **some filenames containing non-letter chars as Japanese/Chinese words**... These's my difficulties now. – wsdzbm Oct 06 '17 at 19:58
  • @bulbus I don't have such a collection including all possible char pairs like this. Comparing letters and digits only may be a workaround. – wsdzbm Oct 06 '17 at 20:03
  • @PaulHodges Thx, sounds like a solution, except for some files named mainly by non-English words. what do you mean by "writing a report"? – wsdzbm Oct 06 '17 at 20:06
  • Just skip files with questionable matches until a human can evaluate them. For the most thorough and exact comparison, use `cmp` (the program - I guess the equivalent in python might be to read both into vars and compare them?) If they are byte-exact, then the names don't matter. It's the same file. – Paul Hodges Oct 06 '17 at 20:11

1 Answers1

1

Maybe you can get the similarity of the strings instead of an exact match. Get the exact match can be problematic because of simple things like capitalization.

I suggest the following:

from difflib import SequenceMatcher

s1 = "structure in the Earth\'s core.pdf"
s2 = "structure in the Earth\xe2\x80\x99s core.pdf"

matcher = SequenceMatcher()
matcher.set_seqs(s1, s2)
print(matcher.ratio())
# 0.9411764705882353

This result shows that the similarity between both strings is over 94%. You could define a threshold to delete or to review the items before deletion.

Arthur Gouveia
  • 734
  • 4
  • 12