I'm using TMDb to look up media based on filename. Most of the time this works fine, except when I use os.listdir()
to search for files with Unicode chars in the name. As far as I can tell, TMDb looks for results in unicode and returns the response in unicode as well.
Take for example a cover art file for Amélie
:
Amélie.jpg
A simple controlled experiment shows that no matter what I try, it works when using typed Unicode strings, but not when using os.listdir()
.
# -*- coding: utf-8 -*-
import os
import tmdbsimple as tmdb
tmdb.API_KEY = '<your-key-here>'
print os.listdir('/media/artwork/')
print os.listdir(u'/media/artwork/')
print('\nstr controlled test')
s = tmdb.Search()
s.movie(query='Amélie')
print('Results', len(s.results))
for r in s.results:
print(r)
print('\nunicode controlled test')
s = tmdb.Search()
s.movie(query=u'Amélie')
print('Results', len(s.results))
for r in s.results:
print(r)
print('\nstr listdir')
for file in os.listdir('/media/artwork/'):
s = tmdb.Search()
s.movie(query=os.path.splitext(file)[0])
print('Results', len(s.results))
for r in s.results:
print(r)
print('\nunicode listdir')
for file in os.listdir(u'/media/artwork/'):
s = tmdb.Search()
s.movie(query=os.path.splitext(file)[0])
print('Results', len(s.results))
for r in s.results:
print(r)
Outputs:
['Ame\xcc\x81lie.jpg']
['u'Ame\u0301lie.jpg']
str controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }
unicode controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }
str listdir
('Results', 0)
unicode listdir
('Results', 0)
So why is the raw string consistently working, ASCII or Unicode, and the filename pulled from the filesystem is not?
I've tried:
- encode('utf-8') and decode('utf-8') in all myriad of combinations
- using u'' prefix in all the file loading
- reloading
sys
with utf-8 encoding - I came across a post from Martijn Pieters about Mac OS handling Unicode differently, but I can't seem to find it again
isinstance(file, str)
(surprise, it's not unicode!)
So... how can I get a folder enumeration to work with unicode chars, OR is there a proper way I can convert these ascii filenames to unicode without the dreaded ordinal out of range
error?