27

my troubles with ConfigParser continue. It seems it doesn't support Unicode very well. The config file is indeed saved as UTF-8, but when ConfigParser reads it it seems to be encoded into something else. I assumed it was latin-1 and I thougt overriding optionxform could help:

-- configfile.cfg -- 
[rules]
Häjsan = 3
☃ = my snowman

-- myapp.py --
# -*- coding: utf-8 -*-  
import ConfigParser

def _optionxform(s):
    try:
        newstr = s.decode('latin-1')
        newstr = newstr.encode('utf-8')
        return newstr
    except Exception, e:
        print e

cfg = ConfigParser.ConfigParser()
cfg.optionxform = _optionxform    
cfg.read("myconfig") 

Of course, when I read the config I get:

'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

I've tried a couple of different variations of decoding 's' but the point seems moot, since it really should be a unicode object from the beginning. After all, the config file is UTF-8? I have confirmed that's something is wrong in the way ConfigParser reads the file by stubbing it out with this DummyConfig class. If I use that then everything is nice unicode, fine and dandy.

-- config.py --
# -*- coding: utf-8 -*-                
apa = {'rules': [(u'Häjsan', 3), (u'☃', u'my snowman')]}

class DummyConfig(object):
    def sections(self):
        return apa.keys()
    def items(self, section):
       return apa[section]
    def add_section(self, apa):
        pass  
    def set(self, *args):
        pass  

Any ideas what could be causing this or suggestions of other config modules that supports Unicode better are most welcome. I don't want to use sys.setdefaultencoding()!

pojo
  • 5,892
  • 9
  • 35
  • 47

5 Answers5

22

The ConfigParser.readfp() method can take a file object, have you tried opening the file object with the correct encoding using the codecs module before sending it to ConfigParser like below:

cfg.readfp(codecs.open("myconfig", "r", "utf8"))

For Python 3.2 or above, readfp() is deprecated. Use read_file() instead.

Christina
  • 1,870
  • 12
  • 16
Tendayi Mawushe
  • 25,562
  • 6
  • 51
  • 57
  • 1
    I had the same issue AND solved it the same way to READ from the config file. But I also need to rewrite a modified version of it and that fails even if I use a codecs.open : `with codecs.open(filename, encoding = ENCODING, mode = 'wb') as conffile: config.write(conffile)` – Ghislain Leveque Apr 04 '11 at 16:16
  • Hi Ghislain, I have the same issue with configparser to write back unicode string. It's solved by update the it to the lastest verion by pip. – Erxin May 14 '13 at 06:15
  • This soloution works perfectly for me. `import configparser` `config = configparser.ConfigParser()` `config.read('settings.ini', 'UTF-8')` – Dominik Dec 15 '20 at 08:50
16

In python 3.2 encoding parameter was introduced to read(), so it can now be used as:

cfg.read("myconfig", encoding='utf-8')
Krzysztof Słowiński
  • 6,239
  • 8
  • 44
  • 62
2

Try to overwrite the write function in RawConfigParser() like this:

class ConfigWithCoder(RawConfigParser):
def write(self, fp):
    """Write an .ini-format representation of the configuration state."""
    if self._defaults:
        fp.write("[%s]\n" % "DEFAULT")
        for (key, value) in self._defaults.items():
            fp.write("%s = %s\n" % (key, str(value).replace('\n', '\n\t')))
        fp.write("\n")
    for section in self._sections:
        fp.write("[%s]\n" % section)
        for (key, value) in self._sections[section].items():
            if key == "__name__":
                continue
            if (value is not None) or (self._optcre == self.OPTCRE):
                if type(value) == unicode:
                    value = ''.join(value).encode('utf-8')
                else:
                    value = str(value)
                value = value.replace('\n', '\n\t')
                key = " = ".join((key, value))
            fp.write("%s\n" % (key))
        fp.write("\n")
user1438038
  • 5,821
  • 6
  • 60
  • 94
LI ZHE
  • 39
  • 2
1

Seems to be a problem with the ConfigParser version for python 2x, and version for 3x is free of this problem. In this issue of the Python Bug Tracker, the status is Closed + WONTFIX.

I've fixed it editing the ConfigParser.py file. In the write method (about the line 412), change:

key = " = ".join((key, str(value).replace('\n', '\n\t')))

by

key = " = ".join((key, str(value).decode('utf-8').replace('\n', '\n\t')))

I don't know if it's a real solution, but tested in Windows 7 and Ubuntu 15.04, works like a charm, and I can share and work with the same .ini file in both systems.

neogurb
  • 690
  • 5
  • 15
  • I ran into this today, still chewing this over but at first sight it seems to me the hard cast str() for doing the replace is unwarranted and unneeded (and therefore itself a bug) in ConfigParser.py. My reasoning being that if the "value" being transformed is a normal Python2 string, then replace will work correctly without the str() cast, while if it is a unicode string then forcing it to str() implies encoding the str buffer with the default "ascii" encoder which is impossible if the string contains Unicode characters. Moreover unicode string also implements .replace(). So, why the str()? – W.Prins Nov 16 '18 at 09:30
  • Also as of this writing your proposed solution doesn't seem to work: >>> key = u'foo' >>> value = u'd\xeb\x02\nvK+' >>> key = " = ".join((key, str(value).decode('utf-8').replace('\n', '\n\t'))) Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 1: ordinal not in range(128) – W.Prins Nov 16 '18 at 10:05
  • Finally I wonder at what ConfigParser is doing here -- inserting tabs after newlines? Is that legit? Seems to be taking liberties with the "value" which it arguably may not? – W.Prins Nov 16 '18 at 10:07
  • And to note, removing str().decode() calls does seem to work: >>> key = u'foo' >>> value = u'd\xeb\x02\nvK+' >>> key = " = ".join((key, value.replace('\n', '\n\t'))) >>> print repr(key) u'foo = d\xeb\x02\n\tvK+' – W.Prins Nov 16 '18 at 10:09
-2

what I did is just:

file_name = file_name.decode("utf-8")
cfg.read(file_name)
president
  • 503
  • 1
  • 3
  • 18