I'm looking for duplicate files by compare the filenames.
However, I found some paths returned by os.walk
contain escaped chars. For example, I may get structure in the Earth\'s core.pdf
for one file and structure in the Earth\xe2\x80\x99s core.pdf
for another.
In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf
In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False
How do I deal with these cases?
==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like
- one filename containing more spaces than the other
- one filename separated by
-
while the other by:
- one filename containing Japanese/Chinese words and the other composed of digits and Japanese/Chinese words ...