Python unicode os.listdir() not returning results from API

Question

I'm using TMDb to look up media based on filename. Most of the time this works fine, except when I use os.listdir() to search for files with Unicode chars in the name. As far as I can tell, TMDb looks for results in unicode and returns the response in unicode as well.

Take for example a cover art file for Amélie:

Amélie.jpg

A simple controlled experiment shows that no matter what I try, it works when using typed Unicode strings, but not when using os.listdir().

# -*- coding: utf-8 -*- 

import os
import tmdbsimple as tmdb

tmdb.API_KEY = '<your-key-here>'

print os.listdir('/media/artwork/')
print os.listdir(u'/media/artwork/')

print('\nstr controlled test')
s = tmdb.Search()
s.movie(query='Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('\nunicode controlled test')
s = tmdb.Search()
s.movie(query=u'Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('\nstr listdir')
for file in os.listdir('/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

print('\nunicode listdir')
for file in os.listdir(u'/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

Outputs:

['Ame\xcc\x81lie.jpg']
['u'Ame\u0301lie.jpg']

str controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

unicode controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

str listdir
('Results', 0)

unicode listdir
('Results', 0)

So why is the raw string consistently working, ASCII or Unicode, and the filename pulled from the filesystem is not?

I've tried:

encode('utf-8') and decode('utf-8') in all myriad of combinations
using u'' prefix in all the file loading
reloading sys with utf-8 encoding
I came across a post from Martijn Pieters about Mac OS handling Unicode differently, but I can't seem to find it again
isinstance(file, str) (surprise, it's not unicode!)

So... how can I get a folder enumeration to work with unicode chars, OR is there a proper way I can convert these ascii filenames to unicode without the dreaded ordinal out of range error?

It would probably be better to simplify your question since it doesn't seem to have anything to do with TMDb. Simply listing one file with unicode characters seems to be sufficient. Seems to be a duplicate of https://stackoverflow.com/questions/26732985/utf-8-and-os-listdir ? — de1, Jan 05 '18 at 08:57
That’s the question I lost from Martijn. I’ll check it out. Fwiw, no, simpler would not be better, because I don’t know for sure yet whether this is related to Unicode or not. — brandonscript, Jan 05 '18 at 16:21
Possible duplicate of [UTF-8 and os.listdir()](https://stackoverflow.com/questions/26732985/utf-8-and-os-listdir) — brandonscript, Jan 05 '18 at 17:28

Mark Tolonen · Accepted Answer · 2018-01-05T17:19:54.047

2

The difference is your file system is using decomposed Unicode characters. If you normalize the filenames returned to composed Unicode characters, it would work \xe9 is the Unicode character é. and e\u0301 is an ASCII e followed by a combining accent:

>>> u'Am\xe9lie' == ud.normalize('NFC',u'Ame\u0301lie')
True

So use:

import unicodedata as ud
print('\nunicode listdir')
for filename in os.listdir(u'/media/artwork/'):
    nfilename = ud.normalize(filename)
    s = tmdb.Search()
    s.movie(query=os.path.splitext(nfilename)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

edited Jan 05 '18 at 17:19

answered Jan 05 '18 at 17:18

Mark Tolonen

166,664
26
169
251

1

Yep, looks like that’s the key. What’s the implications when running on Windows or *nix? Will this fail? – brandonscript Jan 05 '18 at 17:19
2

@brandonscript No, you can normalize an already normalized string, so the above code should work on any OS. – Mark Tolonen Jan 05 '18 at 17:20

Python unicode os.listdir() not returning results from API

1 Answers1