12

I'm working on loading a list of emoji characters in a simple python 3.6 script. The YAML structure is essentially as follows:

-    
- 
- 

My python script looks like this:

import yaml
f = open('emojis.yml')
EMOJIS = yaml.load(f)
f.close()

I'm getting the following exception:

yaml.reader.ReaderError: unacceptable character #x001d: special characters are not allowed in "emojis.yml", position 2

I have seen the allow_unicode=True option but that seems to only be available for yaml.dump. It appears that people have had some trouble with similar issues in Python2, but since all strings should be unicode, I'm having trouble figuring out why this isn't working.

I've also tried wrapping my emojis in quotes and using a customer constructor for 'tag:yaml.org,2002:str'. My custom constructor is never even hit presumably because the yaml lib is failing to recognize my emoji as having the string type. I also observe the same behavior when I define my emoji directly as a string in source.

Is there a way to load a yaml file containing emojis with PyYAML?

Quinn Stearns
  • 162
  • 1
  • 8
  • 1
    I don't think PyYAML supports the SMP at all. – Ignacio Vazquez-Abrams Jul 02 '17 at 21:36
  • 1
    @IgnacioVazquez-Abrams, I'm sorry, no unicode expert. By SMP, do you mean supplementary multilingual plane? Is SMP where emoji support is defined? – Quinn Stearns Jul 02 '17 at 21:42
  • @QuinnStearns SMP is the [supplementary Unicode plane 1](https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Overview) and that plane includes those [emoticons](https://en.wikipedia.org/wiki/Emoticons_%28Unicode_block%29). PyYAML considers those unprintable based on an easy to modify test. The main development of PyYAML stopped long before the emoticons were introduced in 2010 (i.e. in Unicode 6.0 and later), also the reason PyYAML doesn't support the latest YAML 1.2 standard (2009). A simple workaround is to redefine the printable unicode char matching rule. – Anthon Jul 03 '17 at 06:00

2 Answers2

8

You should upgrade to ruamel.yaml (disclaimer: I am the author of that package), which has this, and many other long standing PyYAML issues, fixed:

import sys
from ruamel.yaml import YAML

yaml = YAML()

with open('emojis.yml') as fp:
    idx = 0
    for c in fp.read():
        print('{:08x}'.format(ord(c)), end=' ')
        idx += 1
        if idx % 4 == 0:
            print()

with open('emojis.yml') as fp:
    data = yaml.load(fp)
yaml.dump(data, sys.stdout)

gives:

0000002d 00000020 0001f642 0000000a 
0000002d 00000020 0001f601 0000000a 
0000002d 00000020 0001f62c 0000000a 
['', '', '']

If you really have to stick with PyYAML, you can do:

import yaml.reader
import re

yaml.reader.Reader.NON_PRINTABLE = re.compile(
    u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]')

to get rid of the error.


Starting with version 0.15.16, ruamel.yaml now also dumps all supplementary plane Unicode without reverting to \Uxxxxxxxx (controllable in the new API via .unicode_supplementary, and depending on allow_unicode).

Anthon
  • 69,918
  • 32
  • 186
  • 246
5

Update

the latest version of pyyaml has fixed this bug, upgrade to pyyaml>=5


Original answer

This seems to be a bug in pyyaml, a workaround is to use their escape sequences:

$ cat test.yaml
- "\U0001f642"
- "\U0001f601"
- "\U0001f62c"

$ python
...
>>> yaml.load(open('test.yaml'))
['', '', '']
anthony sottile
  • 61,815
  • 15
  • 148
  • 207