8

Macs normally operate on the HFS+ file system which normalizes paths. That is, if you save a file with accented é in it (u'\xe9') for example, and then do a os.listdir you will see that the filename got converted to u'e\u0301'. This is normal unicode NFD normalization that the Python unicodedata module can handle. Unfortunately HFS+ is not fully consistent with NFD, meaning some paths will not be normalized, for example 福 (u'\ufa1b') will not be changed, although its NFD form is u'\u798f'.

So, how to do the normalization in Python? I would be fine using native APIs as long as I can call them from Python.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Heikki Toivonen
  • 30,964
  • 11
  • 42
  • 44
  • A stupid hack that should work: make an empty file in a temp directory and list it. – Danica Aug 08 '13 at 23:12
  • Note that the temp file hack gets very expensive when you consider that that you can be passed a path that represents a deep directory structure. You would need to do os.makedirs and touch the file and then walk the directory structure to see what got created. – Heikki Toivonen Aug 08 '13 at 23:31
  • Presumably the normalization is consistent between directory and file names, so you could split the parts and only make files for ones that have possibly-changing characters to avoid walking directories. But yes, this is obviously not a very good solution. – Danica Aug 08 '13 at 23:53
  • 1
    It seems like this is practically a duplicate of http://stackoverflow.com/questions/13089582/os-x-how-to-calculate-normalized-file-name and that seems to have the answer I need: NSString fileSystemRepresentation. Not sure if this should be marked duplicate or deleted or what... – Heikki Toivonen Aug 09 '13 at 02:39

1 Answers1

5

Well, decided to write out the Python solution, since the related other question I pointed to was more Objective-C.

First you need to install https://pypi.python.org/pypi/pyobjc-core and https://pypi.python.org/pypi/pyobjc-framework-Cocoa. Then following should work:

import sys

from Foundation import NSString, NSAutoreleasePool

def fs_normalize(path):
    _pool = NSAutoreleasePool.alloc().init()
    normalized_path = NSString.fileSystemRepresentation(path)
    upath = unicode(normalized_path, sys.getfilesystemencoding() or 'utf8')
    return upath

if __name__ == '__main__':
    e = u'\xe9'
    j = u'\ufa1b'
    e_expected = u'e\u0301'

    assert fs_normalize(e) == e_expected
    assert fs_normalize(j) == j

Note that NSString.fileSystemRepresentation() seems to also accept str input. I had some cases where it was returning garbage in that case, so I think it would be just safer to use it with unicode. It always returns str type, so you need to convert back to unicode.

Heikki Toivonen
  • 30,964
  • 11
  • 42
  • 44