0

I'm using TMDb to look up media based on filename. Most of the time this works fine, except when I use os.listdir() to search for files with Unicode chars in the name. As far as I can tell, TMDb looks for results in unicode and returns the response in unicode as well.

Take for example a cover art file for Amélie:

Amélie.jpg

A simple controlled experiment shows that no matter what I try, it works when using typed Unicode strings, but not when using os.listdir().

# -*- coding: utf-8 -*- 

import os
import tmdbsimple as tmdb

tmdb.API_KEY = '<your-key-here>'

print os.listdir('/media/artwork/')
print os.listdir(u'/media/artwork/')

print('\nstr controlled test')
s = tmdb.Search()
s.movie(query='Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('\nunicode controlled test')
s = tmdb.Search()
s.movie(query=u'Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('\nstr listdir')
for file in os.listdir('/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

print('\nunicode listdir')
for file in os.listdir(u'/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

Outputs:

['Ame\xcc\x81lie.jpg']
['u'Ame\u0301lie.jpg']

str controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

unicode controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

str listdir
('Results', 0)

unicode listdir
('Results', 0)

So why is the raw string consistently working, ASCII or Unicode, and the filename pulled from the filesystem is not?

I've tried:

  • encode('utf-8') and decode('utf-8') in all myriad of combinations
  • using u'' prefix in all the file loading
  • reloading sys with utf-8 encoding
  • I came across a post from Martijn Pieters about Mac OS handling Unicode differently, but I can't seem to find it again
  • isinstance(file, str) (surprise, it's not unicode!)

So... how can I get a folder enumeration to work with unicode chars, OR is there a proper way I can convert these ascii filenames to unicode without the dreaded ordinal out of range error?

brandonscript
  • 68,675
  • 32
  • 163
  • 220
  • 1
    It would probably be better to simplify your question since it doesn't seem to have anything to do with TMDb. Simply listing one file with unicode characters seems to be sufficient. Seems to be a duplicate of https://stackoverflow.com/questions/26732985/utf-8-and-os-listdir ? – de1 Jan 05 '18 at 08:57
  • That’s the question I lost from Martijn. I’ll check it out. Fwiw, no, simpler would not be better, because I don’t know for sure yet whether this is related to Unicode or not. – brandonscript Jan 05 '18 at 16:21
  • Possible duplicate of [UTF-8 and os.listdir()](https://stackoverflow.com/questions/26732985/utf-8-and-os-listdir) – brandonscript Jan 05 '18 at 17:28

1 Answers1

2

The difference is your file system is using decomposed Unicode characters. If you normalize the filenames returned to composed Unicode characters, it would work \xe9 is the Unicode character é. and e\u0301 is an ASCII e followed by a combining accent:

>>> u'Am\xe9lie' == ud.normalize('NFC',u'Ame\u0301lie')
True

So use:

import unicodedata as ud
print('\nunicode listdir')
for filename in os.listdir(u'/media/artwork/'):
    nfilename = ud.normalize(filename)
    s = tmdb.Search()
    s.movie(query=os.path.splitext(nfilename)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251