16

I've been working with a the PyYAML parser for a few months now to convert file types as part of a data pipeline. I've found the parser to be quite idiosyncratic at times and it seems that today I've stumbled on another strange behavior. The file I'm currently converting contains the following section:

off:
    yes: "Flavor text for yes"
    no: "Flavor text for no"

I keep a list of the current nesting in the dictionary so that I can construct a flat document, but save the nesting to convert back to YAML later on. I got a TypeError saying I was trying to concatenate a str and bool type together. I investigated and found that PyYaml is actually taking my section of text above and converting it to the following:

with open(filename, "r") as f:
    data = yaml.load(f.read())
print data

>> {False: {True: "Flavor text for yes", False: "Flavor text for no}}

I did a quick check and found that PyYAML was doing this for yes, no, true, false, on, off. It only does this conversion if the keys are unquoted. Quoted values and keys will be passed fine. Looking for solutions, I found this behavior documented here.

Although it might be helpful to others to know that quoting the keys will stop PyYAML from doing this, I don't have this option as I am not the author of these files and have written my code to touch the data as little as possible.

Is there a workaround for this issue or a way to override the default conversion behavior in PyYAML?

Anthon
  • 69,918
  • 32
  • 186
  • 246
sulimmesh
  • 693
  • 1
  • 6
  • 23

5 Answers5

17

PyYAML is YAML 1.1 conformant for parsing and emitting, and for YAML 1.1 this is at least partly documented behavior, so no idiosyncrasy at all, but conscious design.

In YAML 1.2 (which in 2009 superseded the 1.1 specification from 2005) this usage of Off/On/Yes/No was dropped, among other changes.

In ruamel.yaml (disclaimer: I am the author of that package), the round_trip_loader is a safe_loader that defaults to YAML 1.2 behaviour:

import ruamel.yaml as yaml

yaml_str = """\
off:
    yes: "Flavor text for yes"  # quotes around value dropped
    no: "Flavor text for no"
"""

data = yaml.round_trip_load(yaml_str)
assert 'off' in data
print(yaml.round_trip_dump(data, indent=4))

Which gives:

off:
    yes: Flavor text for yes    # quotes around value dropped
    no: Flavor text for no

If your output needs to be version 1.1 compatible then you can dump with an explicit version=(1, 1).

Since the quotes around the nested mapping's scalar values are unnecessary they are not emitted on writing out.


If you need to do this with PyYAML, rewrite the (global) rules it uses for boolean recognition:

import  yaml
from yaml.resolver import Resolver
import re

yaml_str = """\
off:
    yes: "Flavor text for yes"  # quotes around value dropped
    no: "Flavor text for no"
"""

# remove resolver entries for On/Off/Yes/No
for ch in "OoYyNn":
    if len(Resolver.yaml_implicit_resolvers[ch]) == 1:
        del Resolver.yaml_implicit_resolvers[ch]
    else:
        Resolver.yaml_implicit_resolvers[ch] = [x for x in
                Resolver.yaml_implicit_resolvers[ch] if x[0] != 'tag:yaml.org,2002:bool']

data = yaml.load(yaml_str)
print(data)
assert 'off' in data
print(yaml.dump(data))

Which gives:

{'off': {'yes': 'Flavor text for yes', 'no': 'Flavor text for no'}}
off: {no: Flavor text for no, yes: Flavor text for yes}

This works because PyYAML keeps a global dict (Resolver.yaml_implicit_resolvers) which maps first letters to a list of (tag, re.match_pattern) values. For for o, O, y and Y there is only one such pattern (and it can be deleted), but for n/N you can also match null/Null, so you have to delete the right pattern.

After that removal yes, no, on, Off are no longer recognised as bool, but True and False still are.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • I've taken a look at this package before and I do think for the long term this might be what I go with. Keeping comments and special keys is also useful. – sulimmesh Apr 07 '16 at 16:53
  • @skeletalbassman Did you actually get this to work with Ricardo's answer? I more or less stopped reading that once I hit his nonsense about "data conversion by the Constructor class" and he certainly doesn't take the easiest route when doing this in PyYAML (which is changing the pattern for boolean recognition). – Anthon Apr 07 '16 at 18:49
  • Yeah it does work. I needed a quick solution since I'm launching in a week. There are reasons to convert to a package that supports comments, but that would be a project in itself to change our dependencies. – sulimmesh Apr 07 '16 at 18:55
  • @skeletalbassman I update my answer, with just 6 lines you can prevent PyYAML to recognise On/Off/Yes/No as boolean without affecting True and False being a boolean. Quicker and (subjectively) cleaner. – Anthon Apr 07 '16 at 19:12
  • @Anthon. Sorry if what I said sounds like "nonsense" to you. I just had a cursory look at the way PyYAML deals with data and yeah, actually `SafeConstructor` takes a node tagged (by the resolver) as boolean and uses a dict mapping 'yes', 'no', 'true', 'false', 'on', 'off' to `True`/`False` to return the matching boolean value, so I didn't even bother looking further. I tested the proposed solution. It worked. I posted it. It may not be the ideal place to change the behaviour. It's kludgy (and I advertised as that). But hey, it works for a quick job. – Ricardo Cárdenes Apr 08 '16 at 03:07
  • @RicardoCárdenes I specifically indicated what sounded nonsense to me (the Constructor doesn't change data types), not that all you wrote (said) did sound so. And given the little context that the small example the OP gave couldn't imagine it would work correctly for him. PyYAML sucks in that it sets its recognition patterns on import (in `resolver.py:167`), so they are difficult to change. But that IMO the ideal place to attack this. – Anthon Apr 08 '16 at 06:21
  • I see. What I meant with "data type conversion" is "mapping raw data to Python's native types" (ie. int, float, boolean, etc.) My bad, I wrote something funny. – Ricardo Cárdenes Apr 08 '16 at 08:15
7

yaml.load takes a second argument, a loader class (by default, yaml.loader.Loader). The predefined loader is a mash up of a number of others:

class Loader(Reader, Scanner, Parser, Composer, Constructor, Resolver):

    def __init__(self, stream):
        Reader.__init__(self, stream)
        Scanner.__init__(self)
        Parser.__init__(self)
        Composer.__init__(self)
        Constructor.__init__(self)
        Resolver.__init__(self)

The Constructor class is the one mapping the data types to Python. One (kludgy, but fast) way to override the boolean conversion could be:

from yaml.constructor import Constructor

def add_bool(self, node):
    return self.construct_scalar(node)

Constructor.add_constructor(u'tag:yaml.org,2002:bool', add_bool)

which overrides the function that the constructor uses to turn boolean-tagged data into Python booleans. What we're doing here is just returning the string, verbatim.

This affects ALL YAML loading, though, because you're overriding the behaviour of the default constructor. A more proper way to do things could be to create a new class derived from Constructor, and new Loader object taking your custom constructor.

Ricardo Cárdenes
  • 9,004
  • 1
  • 21
  • 34
2

Ran into this problem at work and had to implement it the "correct" way. Here are the steps that I took. Note, I am using the SafeLoader, not the regular Loader. The steps would be VERY similar.

General steps are

  1. Create custom SafeConstuctor
  2. Create custom SafeLoader that imports this custom SafeConstructor
  3. Call yaml.load's "load" function, passing in the custom SafeLoader we created with the custom SafeConstructor

MySafeConstructor.py

from yaml.constructor import SafeConstructor

# Create custom safe constructor class that inherits from SafeConstructor
class MySafeConstructor(SafeConstructor):

    # Create new method handle boolean logic
    def add_bool(self, node):
        return self.construct_scalar(node)

# Inject the above boolean logic into the custom constuctor
MySafeConstructor.add_constructor('tag:yaml.org,2002:bool',
                                      MySafeConstructor.add_bool)
  1. I then create a brand new loader class using the same format as the rest of the loaders defined except we pass in our newly created custom Constructor. We are essentially just "adding" to this list.

MySafeLoader.py

from yaml.reader import *
from yaml.scanner import *
from yaml.parser import *
from yaml.composer import *
from MySafeConstructor import *
from yaml.resolver import *


class MySafeLoader(Reader, Scanner, Parser, Composer, MySafeConstructor, Resolver):

    def __init__(self, stream):
        Reader.__init__(self, stream)
        Scanner.__init__(self)
        Parser.__init__(self)
        Composer.__init__(self)
        MySafeConstructor.__init__(self)
        Resolver.__init__(self)
  1. Finally, we will import the custom safe loader into the main.py or wherever you are doing your load (works in __init__() too)

main.py

# Mandatory imports
from yaml import load
from MySafeLoader import MySafeLoader

def main():

    filepath_to_yaml = "/home/your/filepath/here.yml"

    # Open the stream, load the yaml doc using the custom SafeLoader
    file_stream: TextIO = open(filepath_to_yaml , 'r')
    yaml_as_dict = load(file_stream, MySafeLoader)
    file_stream.close()

    # Print our result
    print(yaml_as_dict)

Now we can use either the standard loader or our custom loader modified for the boolean logic we want. If you want other values than the strings you can try overriding the bool_values list in the MySafeConstructor class, as this is a global list containing the logic for translation.

constructor.py

    bool_values = {
        'yes':      True,
        'no':       False,
        'true':     True,
        'false':    False,
        'on':       True,
        'off':      False,
    }

Note: If you do this, you will not want to override the boolean logic, just override this list.

wski
  • 305
  • 1
  • 10
1

Simply sanitize your input:

import  yaml


def sanitize_load(s):
    s = ' ' + s
    for w in "yes no Yes No Off off On on".split():
        s = s.replace(' ' + w + ':', ' "' + w + '":')
    return yaml.load(s[1:])

with open(filename) as f:
    data = sanitize_load(f.read())
print data

This is mucht better than blindly poking in the horrible depths of pyyaml. That packages comes with two, almost but not quite identical, sources and is a maintenance nightmare.

Felix
  • 27
  • 4
  • 1
    Your advice alters not just keys but any use of ` on:` including things like `{ text: "Turn on: then tune out"}` – Jason S Feb 16 '17 at 20:44
  • @JasonS : Felix's implementation is wrong, but his idea is sound: Don't rely on the inner implementation of yaml because it could change at any moment without notice. Instead, alter the input in a compatible way that will give the good result. Now, _properly_ writing that filter is not going to be easy, _at all_. It would probably need a full and compliant Yaml parser. – Toni Homedes i Saun Sep 28 '18 at 13:31
1

Just for completeness, I combined the answers of @Anthon and @wski which were both good because they

  • still return values true and false as bools, only omitting bool conversion for on, off, yes, no (in any case variation)
  • do not interfere with the global yaml package, i.e. affect other code

Here's the module:

# strict_bool_yaml.py

import yaml
from yaml.loader import Reader, Scanner, Parser, Composer, SafeConstructor, Resolver


class StrictBoolSafeResolver(Resolver):
    pass

# remove resolver entries for On/Off/Yes/No
for ch in "OoYyNn":
    if len(StrictBoolSafeResolver.yaml_implicit_resolvers[ch]) == 1:
        del StrictBoolSafeResolver.yaml_implicit_resolvers[ch]
    else:
        StrictBoolSafeResolver.yaml_implicit_resolvers[ch] = [x for x in
                StrictBoolSafeResolver.yaml_implicit_resolvers[ch] if x[0] != 'tag:yaml.org,2002:bool']

class StrictBoolSafeLoader(Reader, Scanner, Parser, Composer, SafeConstructor, StrictBoolSafeResolver):
    def __init__(self, stream):
        Reader.__init__(self, stream)
        Scanner.__init__(self)
        Parser.__init__(self)
        Composer.__init__(self)
        SafeConstructor.__init__(self)
        StrictBoolSafeResolver.__init__(self)

def load(stream):
    """ Parse stream using StrictBoolSafeLoader. """
    return yaml.load(stream, Loader=StrictBoolSafeLoader)

Now use its load method instead of the one by yaml wherever strict boolean parsing is desired:

import strict_bool_yaml

strict_bool_yaml.load("""
- On
- Off
- on
- off
- true
- True
- false
- yes
- no
""")

Results in:

['On', 'Off', 'on', 'off', True, True, False, 'yes', 'no']
Jeronimo
  • 2,268
  • 2
  • 13
  • 28