Serializing a RangeDict using YAML or JSON in Python

Question

I am using RangeDict to make a dictionary that contains ranges. When I use Pickle it is easily written to a file and later read.

import pickle
from rangedict import RangeDict

rngdct = RangeDict()
rngdct[(1, 9)] = \
    {"Type": "A", "Series": "1"}
rngdct[(10, 19)] = \
    {"Type": "B", "Series": "1"}

with open('rangedict.pickle', 'wb') as f:
    pickle.dump(rngdct, f)

However, I want to use YAML (or JSON if YAML won't work...) instead of Pickle since most of the people seem to hate that (and I want human readable files so they make sense to people reading them)

Basically, changing the code to call for yaml and opening the file in 'w' mode, not in 'wb' does the trick for the writing side, but when I read the file in another script, I get these errors:

File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/yaml/constructor.py", line 129, in construct_mapping
value = self.construct_object(value_node, deep=deep)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/yaml/constructor.py", line 61, in construct_object
"found unconstructable recursive node", node.start_mark)
yaml.constructor.ConstructorError: found unconstructable recursive node

I'm lost here. How can I serialize the rangedict object and read it back in it's original form?

I think the `NSStock` was a typo, if not please add its definition to your example. — Anthon, Oct 03 '17 at 06:11
That's right! Sorry for that, I renamed the variables, but forgot this one. Thanks for the remark! @Anthon — Rene Knuvers, Oct 05 '17 at 07:03

Anthon · Accepted Answer · 2017-10-03T07:46:13.850

TL;DR; Skip to the bottom of this answer for working code

I am sure some people hate pickle, it certainly can give some headaches when refactoring code (when the classes of pickled objects move to different files). But the bigger problem is that pickle is insecure, just a YAML is in the way that you used it.

It is for interesting to note that you cannot pickle to the more readable protocol level 0 (the default in Python 3 is protocol version 3) as:

pickle.dump(rngdct, f, protocol=0) will throw:

TypeError: a class that defines slots without defining getstate cannot be pickled

This is because the RangeDict module/class is a bit minimalistic, which also shows (or rather doesn't) if you try to do:

print(rngdict)

which will just print {}

You probably used the PyYAML dump() routine (and its corresponding, unsafe, load()). And although that can dump generic Python classes, you have to realise that that was implemented before or roughly at the same time as Python 3.0. (and Python 3 support was implemented later on). And although there is no reason a YAML parser could dump and load the exact information that pickle does, it doesn't hook into the pickle support routines (although it could) and certainly not into the information for the Python 3 specific pickling protocols.

Any way, without a specific representer (and constructor) for RangeDict objects, using YAML doesn't really make any sense: it makes loading potentially unsafe and your YAML include all of the gory details that make the object efficient. If you do yaml.dump():

!!python/object:rangedict.RangeDict
_root: &id001 !!python/object/new:rangedict.Node
  state: !!python/tuple
  - null
  - color: 0
    left: null
    parent: null
    r: !!python/tuple [1, 9]
    right: !!python/object/new:rangedict.Node
      state: !!python/tuple
      - null
      - color: 1
        left: null
        parent: *id001
        r: !!python/tuple [10, 19]
        right: null
        value: {Series: '1', Type: B}
    value: {Series: '1', Type: A}

Where IMO a readable representation in YAML would be:

!rangedict
[1, 9]:
  Type: A
  Series: '1'
[10, 19]:
  Type: B
  Series: '1'

Because of the sequences used as keys, this cannot be loaded by PyYAML without major modifications to the parser. But fortunately, those modifications have been incorporated in ruamel.yaml (disclaimer: I am the author of that package), so "all" you need to do is subclass RangeDict to provide suitable representer and constructor (class) methods:

import io
import ruamel.yaml
from rangedict import RangeDict

class MyRangeDict(RangeDict):
    yaml_tag = u'!rangedict'

    def _walk(self, cur):
        # walk tree left -> parent -> right
        if cur.left:
            for x in self._walk(cur.left):
                yield x
        yield cur.r
        if cur.right:
            for x in self._walk(cur.right):
                yield x

    @classmethod
    def to_yaml(cls, representer, node):
        d = ruamel.yaml.comments.CommentedMap()
        for x in node._walk(node._root):
            d[ruamel.yaml.comments.CommentedKeySeq(x)] = node[x[0]]
        return representer.represent_mapping(cls.yaml_tag, d)

    @classmethod
    def from_yaml(cls, constructor, node):
        d = cls()
        for x, y in node.value:
            x = constructor.construct_object(x, deep=True)
            y = constructor.construct_object(y, deep=True)
            d[x] = y
        return d


rngdct = MyRangeDict()
rngdct[(1, 9)] = \
    {"Type": "A", "Series": "1"}
rngdct[(10, 19)] = \
    {"Type": "B", "Series": "1"}

yaml = ruamel.yaml.YAML()
yaml.register_class(MyRangeDict)  # tell the yaml instance about this class

buf = io.StringIO()

yaml.dump(rngdct, buf)
data = yaml.load(buf.getvalue())

# test for round-trip equivalence:
for x in data._walk(data._root):
    for y in range(x[0], x[1]+1):
        assert data[y]['Type'] == rngdct[y]['Type']
        assert data[y]['Series'] == rngdct[y]['Series']

The buf.getvalue() is exactly the readable representation shown before.

If you have to deal with dumping RangeDict itself (i.e. cannot subclass because you use some library that has RangeDict hardcoded), then you can add the attribute and methods of MyRangeDict directly to RangeDict by grafting/monkeypatching.

This is indeed a working answer. Your YAML library does the trick flawlessly and also yields a nicely human readable output file. The "round trip equivalence" part is a bit shady to me, but it does not fall into assert exceptions, so I guess my RangeDict has the correct format / data? — Rene Knuvers, Oct 05 '17 at 07:36
That equivalence part relies on some internals, it just test all the ranges and makes sure that the `rngdct` that was created has the same value for the first value in that range (`x[0]`), as the `data`. Just to make sure you don't `load()` something and got rid of the error, but end up with something entirely different from what you start out with. — Anthon, Oct 05 '17 at 07:43

Serializing a RangeDict using YAML or JSON in Python

1 Answers1