26

I'm working with yaml files that have to be human readable and editable but that will also be edited from Python code. I'm using Python 2.7.3

The file needs to handle accents ( mostly to handle text in French ).

Here is a sample of my issue:

import codecs
import yaml

file = r'toto.txt'

f = codecs.open(file,"w",encoding="utf-8")

text = u'héhéhé, hûhûhû'

textDict = {"data": text}

f.write( 'write unicode     : ' + text + '\n' )
f.write( 'write dict        : ' + unicode(textDict) + '\n' )
f.write( 'yaml dump unicode : ' + yaml.dump(text))
f.write( 'yaml dump dict    : ' + yaml.dump(textDict))
f.write( 'yaml safe unicode : ' + yaml.safe_dump(text))
f.write( 'yaml safe dict    : ' + yaml.safe_dump(textDict))

f.close()

The written file contains:

write unicode     : héhéhé, hûhûhû
write dict        : {'data': u'h\xe9h\xe9h\xe9, h\xfbh\xfbh\xfb\n'}

yaml dump unicode : "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"
yaml dump dict    : {data: "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"}

yaml safe unicode : "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"
yaml safe dict    : {data: "h\xE9h\xE9h\xE9, h\xFBh\xFBh\xFB"}

The yaml dump works perfectly for loading with yaml, but it is not human readable.

As you can see in the exemple code, the result is the same when I try to write a unicode representation of a dict ( I don't know if it is related or not ).

I'd like the dump to contains the text with accent, not the unicode code. Is that possible ?

Anthon
  • 69,918
  • 32
  • 186
  • 246
Hans Baldzuhn
  • 317
  • 1
  • 3
  • 9
  • This is Python **2** I suppose? I'm not too firm in Python 2 Unicode handling, but you may want to try `yaml.safe_dump` instead, which dumps data in implementation-neutral format instead of Python-specific format. – deceze Mar 30 '15 at 09:40
  • Oh yeah sorry, it's python 2.7.3, and using safe_dump has the exact same output. – Hans Baldzuhn Mar 30 '15 at 09:58

2 Answers2

33

yaml is capable of dumping unicode characters by providing the allow_unicode=True keyword argument to any of the dumpers. If you don't provide a file, you will get an utf-8 string back from dump() method (i.e. the result of getvalue() on the StringIO() instance that is created to hold the dumped data) and you have to convert that to utf-8 before appending it to your string

# coding: utf-8

import codecs
import ruamel.yaml as yaml

file_name = r'toto.txt'

text = u'héhéhé, hûhûhû'

textDict = {"data": text}

with open(file_name, 'w') as fp:
    yaml.dump(textDict, stream=fp, allow_unicode=True)

print('yaml dump dict 1   : ' + open(file_name).read()),

f = codecs.open(file_name,"w",encoding="utf-8")
f.write('yaml dump dict 2   : ' + yaml.dump(textDict, allow_unicode=True).decode('utf-8'))
f.close()
print(open(file_name).read())

output:

yaml dump dict 1    : {data: 'héhéhé, hûhûhû'}
yaml dump dict 2    : {data: 'héhéhé, hûhûhû'}

I tested this with my enhanced version of PyYAML (ruamel.yaml), but this should work the same in PyYAML itself.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • Thanks a lot ! This works perfectly. I tried the allow_unicode argument but without success (i was missing the decode part). – Hans Baldzuhn Apr 14 '15 at 08:25
  • Dear [Anthon](https://stackoverflow.com/users/1307905/anthon), I don't know why bít this solution gives me the following error: "UnicodeEncodeError: 'charmap' codec can't encode character...". I use Windows 10 Eng + Python 3.6 – ragesz Jun 20 '18 at 12:49
  • @ragesz Python3 already supports Unicode, if you use that, then don't use the `codecs.open`. – Anthon Jun 23 '18 at 04:42
  • The first method works for me, but it also leaves `!!python/unicode` identifiers in the YAML when no non-ASCII character is present in the text. Is there a way to get rid of those? – Pygmalion Nov 08 '19 at 10:55
  • @Pygmalion I am not sure what your are trying to achieve, with which code, which python version and on which platform. The output is from actually running the code (on Python 2.7), so you must be doing something different. Please post a full question. – Anthon Nov 08 '19 at 15:32
6

Update (2020)

Nowadays, PyYaml does easily process unicode with Python 3, but this requires the allow_unicode=True argument:

import yaml
d = {'a': 'héhéhé', 'b': 'hühühü'}
yaml_code = yaml.dump(d, allow_unicode=True, sort_keys=False)
print(yaml_code)

Will result in:

a: héhéhé
b: hühühü

Note: The sortkeys=False argument should be used as of Python 3.6, to leave the keys of the dictionary unaltered. PyYaml has been traditionally sorting keys, because Python dictionaries did not have a definite order. Even though dictionary keys have been ordered since Python 3.6; and officially since 3.7, PyYaml has kept sorting keys by default.

fralau
  • 3,279
  • 3
  • 28
  • 41