For reasons that are not clear to me, some of the fields that mp4 files use as tag names contain non-printable characters, at least the way mutagen sees them. The one that's causing me trouble is '\xa9wrt'
, which is the tag name for the composer field (!?).
If I run '\xa9wrt'.encode('utf-8')
from a Python console I get
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte
I'm trying to access this value from a Python file that uses some future-proofing, including:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
I can't even figure out how to enter the string '\xa9wrt'
into my code file, since everything in that file is interpreted as utf-8 and the string I'm interested in evidently cannot be written in utf-8. Also, when I get the string '\xa9wrt'
into a variable (say, from mutagen), it's hard to work with. For example, "{}".format(the_variable)
fails because "{}"
is interpreted as u"{}"
, which once again tries to encode the string as utf-8.
Just naively entering '\xa9wrt'
gives me u'\xa9wrt'
, which is not the same, and none of the other stuff I've tried has worked either:
>>> u'\xa9wrt' == '\xa9wrt'
False
>>> str(u'\xa9wrt')
'\xc2\xa9wrt'
>>> str(u'\xa9wrt') == '\xa9wrt'
False
Note this output is from the console, where it does seem that I can enter non-Unicode literals. I'm using Spyder on Mac OS, with sys.version = 2.7.6 |Anaconda 1.8.0 (x86_64)| (default, Nov 11 2013, 10:49:09)\n[GCC 4.0.1 (Apple Inc. build 5493)]
.
How can I work with this string in a Unicode world? Is utf-8 incapable of doing so?
Update: Thank you, @tsroten's, for the answer. It sharpened my understanding but I am still unable to achieve the effect I'm looking for. Here's a sharper form of the question: how could I reach the two lines with '??' on them without resorting to the kinds of tricks I'm using?
Note that the str
that I'm working with is handed to me by a library. I have to accept it as that type
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
tagname = 'a9777274'.decode('hex') # This value comes from a library as a str, not a unicode
if u'\xa9wrt' == tagname:
# ??: What test could I run that would get me here without resorting to writing my string in hex?
print("You found the tag you're looking for!")
else:
print("Keep looking!")
print(str("This will work: {}").format(tagname))
try:
print("This will throw an exception: {}".format(tagname))
# ??: Can I reach this line without resorting to converting my format string to a str?
except UnicodeDecodeError:
print("Threw exception")
Update 2:
I don't think that any of the strings that you (@tsroten) construct are equal to the one that I'm getting from mutagen. That string still seems to cause problems:
>>> u = u'\xa9wrt'
>>> s = u.encode('utf-8')
>>> s2 = '\xa9wrt'
>>> s3 = 'a9777274'.decode('hex')
>>> s2 == s
False
>>> s2 == s3
True
>>> match_tag(s)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
>>> match_tag(s2)
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte