6

I would like to export a Pydantic model to YAML, but avoid repeating values and using references (anchor+aliases) instead.

Here's an example:

from typing import List
from ruamel.yaml import YAML  # type: ignore
import yaml
from pydantic import BaseModel

class Author(BaseModel):
    id: str
    name: str
    age: int

class Book(BaseModel):
    id: str
    title: str
    author: Author

class Library(BaseModel):
    authors: List[Author]
    books: List[Book]


john_smith = Author(id="auth1", name="John Smith", age=42)

books = [
    Book(id="book1", title="Some title", author=john_smith),
    Book(id="book2", title="Another one", author=john_smith),
]

library = Library(authors=[john_smith], books=books)

print(yaml.dump(library.dict()))

I get:

authors:
- age: 42
  id: auth1
  name: John Smith
books:
- author:
    age: 42
    id: auth1
    name: John Smith
  id: book1
  title: Some title
- author:
    age: 42
    id: auth1
    name: John Smith
  id: book2
  title: Another one

You can see that all author fields are repeated in each book. I would like something that uses anchors instead of copying all the information, like this:

authors:
- &auth1
  age: 42
  id: auth1
  name: John Smith
books:
- author: *auth1
  id: book1
  title: Some title
- author: *auth1
  id: book2
  title: Another one

How can I achieve this?

Anthon
  • 69,918
  • 32
  • 186
  • 246
user936580
  • 1,223
  • 1
  • 12
  • 19

1 Answers1

4

When you traverse a nested Python data structure in order to convert it, you have to deal with the possibility of self-reference, otherwise your code will get in an endless loop if the data is self-referential.

The way ruamel.yaml (and the standard library json.dump() ) deal with that is keeping a list of id()s of the collection objects (everything you want to recurse into, so not primitives like int, float, str) and if such an id() is already in the list represent, the first occurrence of that collection object as an anchor and the other occurrences as an alias, so you don't have to recurse again into the object ( json.dump() tells you it cannot dump such a structure, but at least it doesn't hang).

The same mechanism (keeping track of id()s) is used in ruamel.yaml to not repeat the same collection when it is referenced in multiple other collections.

pydantic doesn't seem to do that, hence the "written out" structure you get when calling library.dict(). I think that is the reason why in the documentation you are told to use a string with a class name when dumping pydanctic to JSON with self referential data

To get around this limitation of pydantic you could do two things:

  • write an alternative to .dict() that returns a data structure that dumps to the required YAML document format, which means it needs to return a structure with the same data (dict) in multiple places.

  • make sure you can dump your classes directly using ruamel.yaml, so you don't have to convert them.

But for both of these to work it is required that the author that you add to book1 and book2 is the same after adding, and it is not. You cannot safely assume that if two dicts have the same key/value pairs they are the same object so any comparison will need to be done using is and not using ==.

After you pass in john_smith to the two calls of Book(), you don't have an attribute .author that points to the same data (i.e. has the same id()):

from pydantic import BaseModel
from typing import List

class Author(BaseModel):
    id: str
    name: str
    age: int

class Book(BaseModel):
    id: str
    title: str
    author: Author

class Library(BaseModel):
    authors: List[Author]
    books: List[Book]


john_smith = Author(id="auth1", name="John Smith", age=42)

books = [
    Book(id="book1", title="Some title", author=john_smith),
    Book(id="book2", title="Another one", author=john_smith),
]

library = Library(authors=[john_smith], books=books)

print('same author?',  john_smith is library.books[0].author)
print('same author?',  library.books[0].author is library.books[1].author)

which gives:

same author? False
same author? False

What you can do is force the authors to be identical, and then use something smarter than pydantic's .dict():

import sys
import ruamel.yaml


def gen_data(d, id_map=None):
    if id_map is None:
        id_map = {}
    d_id = id(d)
    if d_id in id_map:
        print('already found', id_map)
        return id_map[d_id]
    if isinstance(d, BaseModel):
        ret_val = {}
        for k, v in d:
            if k == 'author':
                print('auth', v, id(v))
            ret_val[k] = gen_data(v, id_map)
    elif isinstance(d, list):
        ret_val = []
        for elem in d:
            ret_val.append(gen_data(elem, id_map))
    else:
        return d  # should be primitive
    id_map[d_id] = ret_val
    return ret_val

# force authors to be the same
library.books[0].author = library.books[1].author = library.authors[0]
assert  library.books[0].author is library.books[1].author

# alternative for .dict()
data = gen_data(library)
    
yaml = ruamel.yaml.YAML()
yaml.dump(data, sys.stdout)

and that results in what you wanted:

auth id='auth1' name='John Smith' age=42 140494566559168
already found {140494566559168: {'id': 'auth1', 'name': 'John Smith', 'age': 42}, 140494576359168: [{'id': 'auth1', 'name': 'John Smith', 'age': 42}]}
auth id='auth1' name='John Smith' age=42 140494566559168
already found {140494566559168: {'id': 'auth1', 'name': 'John Smith', 'age': 42}, 140494576359168: [{'id': 'auth1', 'name': 'John Smith', 'age': 42}], 140494566559216: {'id': 'book1', 'title': 'Some title', 'author': {'id': 'auth1', 'name': 'John Smith', 'age': 42}}}
authors:
- &id001
  id: auth1
  name: John Smith
  age: 42
books:
- id: book1
  title: Some title
  author: *id001
- id: book2
  title: Another one
  author: *id001

Please note that you shouldn't import yaml, but instead intantiate a ruamel.yaml.YAML() instance.

If necessary, in ruamel.yaml it is possible to control the name of the anchor/alias to something else than the id001.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • 1
    Thank you for your answer and sorry for taking so long to answer. I've found a way of making the authors in dict equal: disabling Pydantic's `copy_on_validation` on `BaseModel`. – user936580 May 01 '22 at 10:05