How to decode cp1252 string?

Question

I am getting an mp3 tag (ID V1) with eyeD3 and would like to understand its encoding. Here is what I try:

>>> print(type(mp3artist_v1))
<type 'unicode'>

>>> print(type(mp3artist_v1.encode('utf-8')))
<type 'str'>

>>> print(mp3artist_v1)
Zåìôèðà

>>> print(mp3artist_v1.encode('utf-8').decode('cp1252'))
ZÃ¥Ã¬Ã´Ã¨Ã°Ã 

>>> print(u'Zемфира'.encode('utf-8').decode('cp1252'))
ZÐµÐ¼Ñ„Ð¸Ñ€Ð°

If I use an online tool to decode the value, it says that the value ZÐµÐ¼Ñ„Ð¸Ñ€Ð° could be converted to correct value Zемфира by changing encodings CP1252 → UTF-8 and value Zåìôèðà by changing encodings like CP1252 → CP1251.

What should I do to get Zемфира from mp3artist_v1? .encode('cp1252').decode('cp1251') works well, but how can I understand possible encoding automatically (just 3 encodings are possible - cp1251, cp1252, utf-8? I was planning to use the following code:

def forceDecode(string, codecs=['utf-8', 'cp1251', 'cp1252']):
    for i in codecs:
        try:
            print(i)
            return string.decode(i)
        except:
            pass
    print "cannot decode url %s" % ([string])

but it does not help since I should encode with one charset first and then decode with another.

gog · Accepted Answer · 2014-04-27T19:41:16.560

This

s = u'Zåìôèðà'
print s.encode('latin1').decode('cp1251')
# Zемфира

Explanation: Zåìôèðà is mistakenly treated as a unicode string, while it's actually a sequence of bytes, which mean Zемфира in cp1251. By applying encode('latin1') we convert this "unicode" string back to bytes, using codepoint numbers as byte values, and then convert these bytes back to unicode telling the decode we're using cp1251.

As to automatic decoding, the following brute force approach seems to work with your examples:

import re, itertools

def guess_decode(s):
    encodings = ['cp1251', 'cp1252', 'utf8']

    for steps in range(2, 10, 2):
        for encs in itertools.product(encodings, repeat=steps):
            r = s
            try:
                for enc in encs:
                    r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
            except (UnicodeEncodeError, UnicodeDecodeError) as e:
                continue
            if re.match(ur'^[\w\sа-яА-Я]+$', r):
                print 'debug', encs, r
                return r

print guess_decode(u'ZÐµÐ¼Ñ„Ð¸Ñ€Ð°')
print guess_decode(u'Zåìôèðà')
print guess_decode(u'ZÃ¥Ã¬Ã´Ã¨Ã°Ã\xA0')

Results:

debug ('cp1252', 'utf8') Zемфира
Zемфира
debug ('cp1252', 'cp1251') Zемфира
Zемфира
debug ('cp1252', 'utf8', 'cp1252', 'cp1251') Zемфира
Zемфира

Thank you. With this help, I just wrote a basic plugin for Picard mp3 tagger to decode mangled cyrillic tags. https://github.com/Aeon/picard-plugins/blob/master/plugins/decode_cyrillic/decode_cyrillic.py — Aeon, Nov 20 '15 at 00:20

How to decode cp1252 string?

1 Answers1

Linked