0

I am having an issue with character encoding with Mutagen.

I casted the dict[key] to Unicode, bu all I receive are errors. The character in question is U+00E9 or é, but what I prints is ├⌐. I am assuming the default character set for Mutagen is UTF-8, but is there a way to fix this?

Output:

Winter Wonderland.mp3
Album       : Christmas
Album Artist: Michael Bublé
Artist      : Michael Bublé
Composer    : None
Disk        : None
Encoded By  : None
Genre       : Christmas
Title       : Winter Wonderland
Track       : 17/19
Year        : 2011

Code:

#!/usr/bin/env python

import os
import re
from mutagen.mp3 import MP3

first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
def convertCamelCase2Underscore(name):
    s1 = first_cap_re.sub(r'\1_\2', name)
    return all_cap_re.sub(r'\1_\2', s1).lower()

def convertCamelCase2CapitalizedWords(name):
    return ' '.join([x.capitalize() for x in convertCamelCase2Underscore(name).split('_')])

def safeValue(dict, key):
    return None if key not in dict else dict[key]

class Track:
    def __init__(self, path):
        audio = MP3(path)
        self.title = safeValue(audio, 'TIT2')
        self.artist = safeValue(audio, 'TPE1')
        self.albumArtist = safeValue(audio, 'TPE2')
        self.album = safeValue(audio, 'TALB')
        self.genre = safeValue(audio, 'TCON')
        self.year = safeValue(audio, 'TDRL')
        self.encodedBy = safeValue(audio, 'TENC')
        self.composer = safeValue(audio, 'TXXX:TCM')
        self.track = safeValue(audio, 'TRCK')
        self.disk = safeValue(audio, 'TXXX:TPA')
    def __repr__(self):
        ret = ''
        fields = self.__dict__

        for k, v in sorted(self.__dict__.iteritems()):
            ret += '{:12s}: {:s}\n'.format(convertCamelCase2CapitalizedWords(k), v)
        return ret

files = os.listdir('.')

for filename in files:
    print filename
    print Track(filename)
Mr. Polywhirl
  • 42,981
  • 12
  • 84
  • 132
  • Textual information in ID3v2 tags can be encoded in a [number of different ways](http://en.wikipedia.org/wiki/Id3v2#ID3v2). I see nothing in [Mutagen's latest documentation](http://mutagen.readthedocs.org/en/latest/) that specifies any kind of default character set, so it's possible it's just returning raw tag data -- although the project page says it supports Unicode. If all else fail, you could take a look at the source code since it's open source. – martineau Dec 06 '13 at 05:02

1 Answers1

1

I am assuming the default character set for Mutagen is UTF-8

Mutagen returns Unicode strings, though wrapped in a TextFrame object. When you print that object it's an implicit str() conversion of the text property to bytes, and Mutagen (arbitrarily) chooses UTF-8 for that encoding.

Unfortunately the Windows console doesn't support UTF-8[1]. The encoding it uses varies but in your case you are getting the US DOS code page 437 where the byte sequence 0xC3 0xA9 represents ├⌐ and not é. You could try to print to the console in the encoding that it wants by explicitly encoding to it:

print unicode(audio['TIT2']).encode(sys.stdout.encoding)  # 'cp437'

but this will still only allow you to print characters that are supported in that code page. 437 is OK for Michael Bublé, but not so good for 東京事変. There isn't a good way to get Unicode out to the Windows console.[2]

[1] There is code page 65001 which is supposed to be UTF-8, but there are bugs in the MS implementation which usually make it unusable.

[2] You can, if you must, call the Win32 API WriteConsoleW directly using ctypes, but then you have to take care to only do that when you are connected to a Windows console and not any other type of stream so you don't break everywhere else. It's usually not worth it; Windows users are assumed to be used to a console where non-ASCII characters just break all the time.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • So do all these big-wig music players out there write proprietary code to read and write on a per-OS system? – Mr. Polywhirl Dec 08 '13 at 15:44
  • 2
    The problem is not with reading the files. (Well, there are problems there too, but that's a different story.) The problem you are having is purely to do with printing to the Windows console. It's well known that the standard Windows console sucks and most software/languages can't print Unicode to it reliably. – bobince Dec 08 '13 at 16:59