0

For reasons that are not clear to me, some of the fields that mp4 files use as tag names contain non-printable characters, at least the way mutagen sees them. The one that's causing me trouble is '\xa9wrt', which is the tag name for the composer field (!?).

If I run '\xa9wrt'.encode('utf-8') from a Python console I get

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

I'm trying to access this value from a Python file that uses some future-proofing, including:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

I can't even figure out how to enter the string '\xa9wrt' into my code file, since everything in that file is interpreted as utf-8 and the string I'm interested in evidently cannot be written in utf-8. Also, when I get the string '\xa9wrt' into a variable (say, from mutagen), it's hard to work with. For example, "{}".format(the_variable) fails because "{}" is interpreted as u"{}", which once again tries to encode the string as utf-8.

Just naively entering '\xa9wrt' gives me u'\xa9wrt', which is not the same, and none of the other stuff I've tried has worked either:

>>> u'\xa9wrt' == '\xa9wrt'
False
>>> str(u'\xa9wrt')
'\xc2\xa9wrt'
>>> str(u'\xa9wrt') == '\xa9wrt'
False

Note this output is from the console, where it does seem that I can enter non-Unicode literals. I'm using Spyder on Mac OS, with sys.version = 2.7.6 |Anaconda 1.8.0 (x86_64)| (default, Nov 11 2013, 10:49:09)\n[GCC 4.0.1 (Apple Inc. build 5493)].

How can I work with this string in a Unicode world? Is utf-8 incapable of doing so?

Update: Thank you, @tsroten's, for the answer. It sharpened my understanding but I am still unable to achieve the effect I'm looking for. Here's a sharper form of the question: how could I reach the two lines with '??' on them without resorting to the kinds of tricks I'm using?

Note that the str that I'm working with is handed to me by a library. I have to accept it as that type

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

tagname = 'a9777274'.decode('hex') # This value comes from a library as a str, not a unicode
if u'\xa9wrt' == tagname:
    # ??: What test could I run that would get me here without resorting to writing my string in hex?
    print("You found the tag you're looking for!")
else:
    print("Keep looking!")

print(str("This will work: {}").format(tagname))
try:
    print("This will throw an exception: {}".format(tagname))
    # ??: Can I reach this line without resorting to converting my format string to a str?
except UnicodeDecodeError:
    print("Threw exception")

Update 2:

I don't think that any of the strings that you (@tsroten) construct are equal to the one that I'm getting from mutagen. That string still seems to cause problems:

>>> u = u'\xa9wrt'
>>> s = u.encode('utf-8')
>>> s2 = '\xa9wrt'
>>> s3 = 'a9777274'.decode('hex')
>>> s2 == s
False
>>> s2 == s3
True
>>> match_tag(s)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
>>> match_tag(s2)
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte
kuzzooroo
  • 6,788
  • 11
  • 46
  • 84

3 Answers3

1

\xa9 is the copyright symbol. See C1 Controls and Latin-1 Supplement from the Unicode Standard for more information.

Maybe the tag ©wrt means "Copyright" and not "Composer"?

When you run '\xa9wrt'.encode('utf-8'), the reason you are getting UnicodeDecodeError is because encode() expects unicode, but you gave it str. So, it first converts it to unicode, but assumes that the str encoding is 'ascii' (or some other default). That's why you get a decode error when you're encoding. This problem should be fixed by using unicode: u'\xa9wrt'.encode('utf-8').

In the Python interpreter, by default, type('') should return <type 'str'>. If, in the interpreter, you first type from __future__ import unicode_literals, then type('') should return <type 'unicode'>. You say, Just naively entering '\xa9wrt' gives me u'\xa9wrt', which is not the same. However, your statement is sometimes right and sometimes wrong. Whether or not u'\xa9wrt' == '\xa9wrt' evaluates to True or False depends if you've imported unicode_literals.

Copy, paste, and save the following to a file (e.g. test.py), then run python test.py from the command-line.

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

tag1 = u'\xa9wrt'
tag2 = '\xa9wrt'
print("tag1 = u'\\xa9wrt'")
print("tag2 = '\\xa9wrt'")
print("tag1: %s" % tag1)
print("tag2: %s" % tag1)
print("type(tag1): %s" % type(tag1))
print("type(tag2): %s" % type(tag2))
print("tag1 == tag2: %s" % (tag1 == tag2))
try:
    print("str(tag1): %s" % str(tag1))
except UnicodeEncodeError:
    print("str(tag1): raises UnicodeEncodeError")
print("tag1.encode('utf-8'): ".encode('utf-8') + tag1.encode('utf-8'))

After copying and pasting the above code into a file, then running it in Python 2.7, I got the following output:

tag1 = u'\xa9wrt'
tag2 = '\xa9wrt'
tag1: ©wrt
tag2: ©wrt
type(tag1): <type 'unicode'>
type(tag2): <type 'unicode'>
tag1 == tag2: True
str(tag1): raises UnicodeEncodeError
tag1.encode('utf-8'): ©wrt

EDIT:

Your life will be much easier if your code uses unicode internally. That means, when you receive input, you convert it to unicode, or when you output, you convert to str (if needed). So, when you receive a str tagname from somewhere, convert it to unicode first.

For example, here is test.py:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

def match_tag(tagname):
    if isinstance(tagname, str):
        # tagname comes in as str, so let's convert it
        tagname = tagname.decode('utf-8')  # enter the correct encoding here

    # Now that we have a unicode tag, we can deal with it easily:
    if tagname == '\xa9wrt':
        print("We have a match! tagname == %s" % tagname)
        print("Look! We printed tagname and no exception was raised.")

Then, we run it:

>>> from test import match_tag
>>> u = u'\xa9wrt'
>>> s = u.encode('utf-8')
>>> type(u)
<type 'unicode'>
>>> type(s)
<type 'str'>
>>> match_tag(u)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
>>> match_tag(s)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.

So, you need to find out what encoding your input string uses. Then, you'll be able to convert that str to unicode and your code can flow much better.

EDIT 2:

If you are simply trying to get s2 = '\xa9wrt' to work, then you need to decode it correctly first. s2 is a str with the default encoding (check sys.getdefaultencoding() to see which one -- probably ascii). But, \xa9 isn't an ASCII character, so Python automatically escapes it. That's the problem with s2. Try this when feeding it to match_tag():

>>> s2 = '\xa9wrt'
>>> s2_decoded = s2.decode('unicode_escape')
>>> type(s2_decoded)  # This is unicode, just like we want.
<type 'unicode'>
>>> match_tag(s2_decoded)
We have a match! tagname == ©wrt
Look! We printed tagname and no exception was raised.
tsroten
  • 2,534
  • 1
  • 14
  • 17
  • Thank you for the detailed answer. Unfortunately I'm still coming up short of the finish line. I've updated the question to clarify the issue. – kuzzooroo Mar 10 '14 at 00:38
  • I've edited the answer to address your next question. – tsroten Mar 10 '14 at 02:46
  • Also, when you import ``unicode_literals`` you can stop prepending ``u`` to your strings. – tsroten Mar 10 '14 at 03:04
  • You ask, "Are you reading the tag name from a file?" The answer is that I'm getting the tag name from a library called mutagen that reads ID3 tags and other metadata from music files. Mutagen is drawing that information from the music files themselves, so yes, I believe the tag name ultimately comes from a file. – kuzzooroo Mar 10 '14 at 04:19
  • I've added the answer to your latest update. This should address your question. – tsroten Mar 10 '14 at 05:46
  • Yes, this does it! Is this what's going on? When I pass the str to unicode(), the computer says, "This first character is not in the first 128 ASCII characters, which are the only ones that are standardized. So something fancy must be going on." The string gets passed on to some other part of Python that's in charge of handling fancy stuff. That fancy stuff part says, "I can't understand this. I'll raise an exception." But decoding with `unicode_escape` fixes this by saying, "When I ask for character \xa9, what I really wanted is the \xa9th (169th in base 10) unicode character. – kuzzooroo Mar 10 '14 at 13:45
1

The string is encoded in Latin-1, so if you want to store it in a UTF-8 file or compare it with a UTF-8 string, just do:

>>> '\xa9wrt'.decode('latin-1').encode('utf-8')
'\xc2\xa9wrt'

Or if you want to compare to a Unicode string:

>>> '\xa9wrt'.decode('latin-1') == u'©wrt'
True
lkraider
  • 4,111
  • 3
  • 28
  • 31
  • Thank you. How did you determine that the string is encoded in Latin-1? – kuzzooroo Mar 15 '14 at 21:11
  • From the [Quodlibet mailing list](https://groups.google.com/d/msg/quod-libet-development/DvscxyfclyM/2l4URdR9If0J), the [mp4v2 lib](http://code.google.com/p/mp4v2/source/browse/trunk/src/mp4atom.cpp#719) source code, and the [Quicktime file format specification](http://multimedia.cx/mirror/qtff-2007-09-04.pdf) (pages 42-45). – lkraider Mar 17 '14 at 14:50
  • A [latin-1 character table](http://www.idautomation.com/product-support/ascii-chart-char-set.html) also shows that byte 0xa9 is the copyright © character. – lkraider Mar 17 '14 at 14:53
0

I've finally found a way to express the string in question in a utf-8 file with unicode_literals. I convert the string to hex and then back. Specifically, in the console (which is apparently not in unicode_literals mode), I run

"".join(["{0:x}".format(ord(c)) for c in '\xa9wrt'])

and then in my source file I can create the string I want with

'a9777274'.decode('hex')

But this can't be the right way, can it? For one thing, if my console were running in full unicode I don't know that I could enter the string '\xa9wrt' in the first place to get Python to tell me the hex sequence that represents the byte string.

kuzzooroo
  • 6,788
  • 11
  • 46
  • 84