I have a collection of files from an older MAC OS file store. I know that there are filename / path name issues with the collection. The issue stems from the inclusion of a codepoint in the path that I think was rendered as a dash in the original OS, but windows struggles with the codepoint, and either includes a diacritic on the previous character, or replaces it with a ?
I'm trying to figure out a way to establishing a "truth" of the files structure, so I can be sure I'm accounting for every file.
I have explored the files with a few tools, and nothing has matching tallies. I believe the following demonstrates the problem.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
folder = "emails"
b = os.listdir(folder)
for f in b:
print repr(f)
print os.path.isfile(os.path.join(folder, f))
(I have to redact the actual filenames a litte)
Results in:-
'file(1)'
True
'file(2)'
True
'file(3)?'
False
'file(4)'
True
The file name of interest is file(3)?
, where the odd codepoint has been decoded as a ?
, and which evaluates as not being a file (or even exisiting via os.path.exists
).
Note that print repr(string)
shows that its handling a UTF-8, properly encoded ?
.
I can copy paste the filename from the folder and it appears as : file(3)
note the fullstop.
I can paste the string into my editor (subl) and see that I now have an undisplayable codepoint glyph for the final codepoint
a = "file(3)"
print a
print repr(a)
Gives me:
file(3)
'file(3)\xef\x80\xa9'
From this I can see that the odd code point is \xef\x80\xa9
. Elsewhere in the set I also find the codepoint \xef\x80\xa8
.
I must assume that os.listdir
is not returning raw codepoint values but an (UTF-8?) encoded string, with a codepoint subsitution that means when it tests for exists
or isfile
its testing for the existance of the wrong filename, as the the file with a subsituted ?
does not exist.
How do I work with these files safely? I have around 40 in a collection of around 700 files.