3

I'm on a OSX machine and running Python 2.7. I'm trying to do a os.walk on a smb share.

for root, dirnames, filenames in os.walk("./test"):
        for filename in filenames:

            print filename

            matchObj = re.match( r".*ö.*",filename,re.UNICODE)

if i use the above code it works as long as the filename do not contain umlauts. In my shell the umlauts are printed fine but when I copy them back to a utf8 formated Textdeditor (in my case Sublime), I get:

screenshot Expected:

filename.jpeg
filename_ö.jpg

Of course the regex fails with that. if i hardcode the filename like:

re.match( r".*ö.*",'filename_ö',re.UNICODE)

it works fine.

I tried:

os.walk(u"./test")
filename.decode('utf8')

but gives me:

return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0308' in position 10: ordinal not in range(128)

u'\u0308' are the dots above the umlauts.

I'm overlooking something stupid i guess?

Tim
  • 147
  • 1
  • 10
  • In Python 2.x, you need to pass a Unicode object to `os.walk()` else you'll get an encoded string, using the raw filename in Unix or 8bit charset in Windows (In OS X, the filename is always UTF-8 encoded) – Alastair McCormack Nov 13 '15 at 20:01

2 Answers2

7

Unicode characters can be represented in various forms; there's "ö", but then there's also the possibility to represent that same character using an "o" and separate combining diacritics. OS X generally prefers the separated variant, and your editor doesn't seem to handle that very gracefully, nor do these two separate characters match your regex.

You need to normalize your Unicode data if you require one way or the other in particular. See unicodedata.normalize. You want the NFC normalized form.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 1
    Unicode normalization is [not the only issue in the code](http://stackoverflow.com/a/33678596/4279). – jfs Nov 12 '15 at 18:26
3

There are several issues:

  1. The screenshot as @deceze explained is due to Unicode normalization. Note: it is not necessary for the codepoints to look different e.g., ö (U+00f6) and ö (U+006f U+0308) look the same in my browser

  2. r".*ö.*" is a bytestring in Python 2 and the value depends on the encoding declaration at the top of your Python source file (something like: # -*- coding: utf-8 -*-) e.g., if the declared encoding is utf-8 then 'ö' bytestring is a sequence of two bytes: '\xc3\xb6'.

    There is no way for the regex engine to know the actual encoding that should be used to interpret input bytestrings.

    You should not use bytestrings, to represent text; use Unicode instead (either use u'' literals or add from __future__ import unicode_literals at the top)

  3. filename.decode('utf8') raises UnicodeEncodeError if you use os.walk(u"./test") because filename is Unicode already. Python 2 tries to encode filename implicitly using the default encoding that is 'ascii'. Do not decode Unicode: drop .decode('utf-8')

btw, the last two issues are impossible in Python 3: r".*ö.*" is a Unicode literal, and you can't create a bytestring with literal non-ascii characters there, and there is no .decode() method (you would get AttributeError if you try to decode Unicode). You could run your script on Python 3, to detect Unicode-related bugs.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • The OP isn't passing a Unicode to `os.walk()`, although they should be. You might want to update point 3 to recommend passing Unicode and dropping `.decode()` – Alastair McCormack Nov 13 '15 at 19:58
  • @AlastairMcCormack: read the question. OP have tried several variants. One of them `os.walk(u"./test")` i.e., with Unicode. – jfs Nov 13 '15 at 20:03