De-mojibaking with Python and mutagen

Question

I'm reading mojibaked ID3 tags with mutagen. My goal is to fix the mojibake while learning about encodings and Python's handling thereof.

The file I'm working with has an ID3v2 tag, and I'm looking at its album (TALB) frame, which is, according to the encoding byte in the TALB ID3 frame, encoded in Latin-1 (ISO-8859-1). I know that the bytes in this frame, however, are encoded in cp1251 (Cyrillic).

Here's my code so far:

 >>> from mutagen.mp3 import MP3
 >>> mp3 = MP3(paths[0])
 >>> mp3['TALB']
 TALB(encoding=0, text=[u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'])

Now, as you can see, mp3['TALB'].text[0] is represented here as a Unicode string. However, it's mojibaked:

 >>> print mp3['TALB'].text[0]
 Áóðæóéñêèå ïëÿñêè

I am having very little luck at transcoding these cp1251 bytes into their correct Unicode codepoints. My best results so far have been very unbecoming:

>>> st = ''.join([chr(ord(x)) for x in mp3['TALB'].text[0]]); st
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Буржуйские пляски <-- **this is the correct, demojibaked text!**

As I understand this approach, it works because I end up transforming the Unicode string into an 8-bit string, which I can then decode into Unicode, while specifying the encoding I am decoding from.

The problem is that I can't decode('cp1251') on the Unicode string directly:

>>> st = mp3['TALB'].text[0]; st
u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/dmitry/dev/mp3_tag_encode_convert/lib/python2.7/encodings/cp1251.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

Can someone explain this? I can't understand how to make it not decode into the 7-bit ascii range when operating directly on the u'' string.

score 5 · Accepted Answer · answered Jan 05 '13 at 02:48

5

First, encode it in the encoding that you know it is already in.

>>> tag = u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> raw = tag.encode('latin-1'); raw
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'

Then you can decode it in the proper encoding.

>>> fixed = raw.decode('cp1251'); print fixed
Буржуйские пляски

answered Jan 05 '13 at 02:48

Fredrick Brennan

7,079
2
30
61

Heh! Interesting! Thank you very much! I guess I didn't think of that as practically the same thing I was doing because I don't really understand this encode step you provided. I've got a Unicode string from mutagen (why does mutagen output Unicode when the tag specified 8-bit? I don't know). I'm encoding this Unicode it into a Latin-1 string. Since all the bytes are in the 8-bit range, what's the practical difference between the strings when I try `decode('cp1251)` on them? Why does decoding from Unicode fail? – Dmitry Minkovsky Jan 05 '13 at 03:03
2

It fails because `u''.decode(enc)` is the same as `u''.encode(sys.getdefaultencoding()).decode(enc)`, and `sys.getdefaultencoding()` is almost always `== 'ascii'`. – Francis Avila Jan 05 '13 at 03:14
Thanks for the clarification you provided Francis. To clarify my other point of confusion regarding the encode('latin-1') step—I'm not sure **logically** why I'd want to encode into Latin-1 first. Someone used `cp1251`, while the ID3 spec thinks it's Latin-1, so I get mojibake. What I want to do is transcode the `cp1251` into Unicode. If I was using a pen and pad, I'd look up these falsely-Unicode-coded codepoints in `cp1251`, and get the real Unicode codepoints. Perhaps you could shed some light as to why `u''.decode(enc)` has to involve an intermediate `encode(enc)` in the first place? – Dmitry Minkovsky Jan 05 '13 at 03:46
1

Because you cannot decode a `unicode` object. It is already decoded into Unicode codepoints. Unicode objects can **only** be encoded into bytestrings, and bytestrings can **only** be decoded into Unicode objects. So, why can't Python encode the mojibaked string as CP1251? Simple, it's a Unicode object, and in Unicode \xc1 is Á. But, CP1251 has no Á. – Fredrick Brennan Jan 05 '13 at 15:46
Thank you very much frb, and everyone else. This Q/A and thread has really helped me out. I feel pretty comfortable in my understanding of this now. As I understand, there were `cp1251`-encoded bytestrings in my ID3v2 tags. The tags incorrectly indicated that the bytestrings were `latin-1`. So, Mutagen takes these bytestrings, and does `{str}.decode('latin-1')`, returning a `unicode` object to me. This unicode object is mojibaked. To get the original bytestring I take the `unicode` object and encode it back into Latin-1: `{unicode}.encode('latin-1')`. Then I can do the desired `{str}.decode()` – Dmitry Minkovsky Jan 19 '13 at 04:07

De-mojibaking with Python and mutagen

1 Answers1