2

I have a collection of files from an older MAC OS file store. I know that there are filename / path name issues with the collection. The issue stems from the inclusion of a codepoint in the path that I think was rendered as a dash in the original OS, but windows struggles with the codepoint, and either includes a diacritic on the previous character, or replaces it with a ?

I'm trying to figure out a way to establishing a "truth" of the files structure, so I can be sure I'm accounting for every file.

I have explored the files with a few tools, and nothing has matching tallies. I believe the following demonstrates the problem.

 #!/usr/bin/env python
 # -*- coding: utf-8 -*-

 import os

 folder = "emails"

 b = os.listdir(folder)

 for f in b:
      print repr(f)
      print os.path.isfile(os.path.join(folder, f))

(I have to redact the actual filenames a litte)

Results in:-

'file(1)'
True
'file(2)'
True
'file(3)?'
False
'file(4)'
True

The file name of interest is file(3)?, where the odd codepoint has been decoded as a ?, and which evaluates as not being a file (or even exisiting via os.path.exists). Note that print repr(string) shows that its handling a UTF-8, properly encoded ?.

I can copy paste the filename from the folder and it appears as : file(3) note the fullstop.

I can paste the string into my editor (subl) and see that I now have an undisplayable codepoint glyph for the final codepoint

a = "file(3)"

print a
print repr(a)

Gives me:

 file(3)
'file(3)\xef\x80\xa9'

From this I can see that the odd code point is \xef\x80\xa9. Elsewhere in the set I also find the codepoint \xef\x80\xa8.

I must assume that os.listdir is not returning raw codepoint values but an (UTF-8?) encoded string, with a codepoint subsitution that means when it tests for exists or isfile its testing for the existance of the wrong filename, as the the file with a subsituted ? does not exist.

How do I work with these files safely? I have around 40 in a collection of around 700 files.

Jay Gattuso
  • 3,890
  • 12
  • 37
  • 51
  • [this](http://www.fileformat.info/info/unicode/char/f029/index.htm) is that codepoint, if you're curious. It's an odd one. – roippi May 02 '14 at 01:51
  • It is indeed. Its not UTF-8. So its difficult to know what the real associated glyph should be. I seen this issue a number of times when wrangling older sets of MAC OS created files. I suspect its a old method of character encoding thats not supported anymore. – Jay Gattuso May 02 '14 at 01:53
  • What happens if you pass a unicode to `os.listdir`: `b = os.listdir(u'email')`? – unutbu May 02 '14 at 01:55
  • @unutbu Interesting. `u'file(2)\uf029' and 'True'. This might be the answer. This approach means 2 of the 3 tools tally on the full set now. I can work from here. If you add this as an answer I will close my question. – Jay Gattuso May 02 '14 at 02:02

2 Answers2

1

Try passing a unicode to os.listdir:

folder = u"emails"
b = os.listdir(folder)

Doing so will cause os.listdir to return a list of unicodes instead of strs.


Unfortunately, the more I think about this the less I understand about why this worked. Every filesystem ultimately stores its filenames in bytes using some encoding. HDF+ for instance stores filenames in UTF-16. So it would make sense if os.listdir could return those raw bytes most easily without adulteration. But instead, in this case, it looks like os.listdir can return unadulterated unicode, but not unadulterated bytes.

If someone could explain that mystery I would be most appreciative.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

Did the files come from Mac Roman encoding (presumably what MacOS used), or the NFKD normal form of UTF-8 that Mac OS X uses?

The concept of Unicode normal forms is one that every programmer ought to be familiar with.... precious few are though. I can't tell you what you need too know about this with regard to Python though.

Cameron Kerr
  • 1,725
  • 16
  • 23