4

I'm attempting to write YAML files from Python that contain code snippets in various languages, and I want these snippets to be human-legible in the resulting YAML file, using |-style literals with indented multiline strings.

Various answers such as this one suggest using yaml.representer() to set style='|' only for strings that actually contain newlines, but for simplicity we'll suppose we want to print all strings in the | style. This should be achievable by calling yaml.dump(..., default_style="|").

For some strings, this works fine.

However, for the string "foobar\n " (an 8-char string ending in a newline and a space), it refuses to write the literal in multiline form, and instead quotes the string with escaped newlines.

At first I suspected that | in the YAML spec somehow doesn't support lines that contain only spaces, but this is not the case. The following code shows my desired, hand-written YAML translation is valid and will yaml.load() correctly:

data = "foobar\n "  # Data ends in a newline and a space
desired_yaml = """\
|
  foobar
   """ # Note final YAML line has 3 spaces - 2 for indent, 1 for data

# Demonstrate that `desired_yaml` is equivalent to `data`:
import yaml
assert yaml.safe_load(desired_yaml) == data

If I try to generate this YAML programmatically, though, PyYAML insists on quoting/escaping the string:

dumped = yaml.safe_dump(data, default_style="|")
assert dumped != desired_yaml 
print(dumped)

The printed output shows that dumped is now a quoted string containing escaped newlines:

"foobar\n "

I'd like the YAML output to instead look like this -- a three-line string ending in three spaces:

|
  foobar
   

How can I achieve this with PyYAML? If this is not possible, are there third-party alternatives that would work?

goodside
  • 4,429
  • 2
  • 22
  • 32
  • I've also found [this SO comment](https://stackoverflow.com/questions/50519454/python-yaml-dump-using-block-style-without-quotes?rq=1#comment88081725_50519774) that mentions encountering this issue without a solution. – goodside Apr 12 '22 at 03:30
  • You can try using [ruamel](https://yaml.readthedocs.io/en/latest/) but generally, YAML is the wrong data format if your requirement is to have complete control over the serialization. – flyx Apr 12 '22 at 14:14

2 Answers2

4

Ultimately, this appears to be a bug in PyYAML that affects any multiline string with lines ending in whitespace, tracked on GitHub as issue 441. The issue was first reported in 2020 and hasn't moved since then, as of April 2022.

As suggested in the comments above, I solved my immediate issue by using ruamel.yaml, a drop-in replacement for PyYAML. Adapting the above code to import ruamel.yaml as yaml seems to result in default_style="|" being applied correctly.

The commenter also notes that YAML is the "wrong data format if your requirements is to have complete control over the serialization," and judging by my experience with PyYAML that sounds right. However, there are very few formats capable of legibly displaying blocks of arbitrary source code (in an arbitrary programming language with arbitrary unescaped characters) in a structured file format. As far as I can see, only YAML and NestedText support this. YAML is working for me for right now, but in hindsight NestedText may have been an easier choice.

goodside
  • 4,429
  • 2
  • 22
  • 32
  • 1
    see also some backstory from `rumael.yaml` about changing the API from `PyYAML`! https://yaml.readthedocs.io/en/latest/api.html#reason-for-api-change – ti7 Apr 12 '22 at 15:48
  • 1
    You should look into using the new API in ruamel.yaml ( `yaml = ruamel.yaml.YAML(typ='safe', pure=True)` and then `yaml.load()`. That should not affect processing and doesn't use the deprecated PyYAML API. – Anthon Apr 20 '22 at 16:18
0

I don't have any experience with ruamel.yaml, but it looks high-quality!
However, if you're in an environment without it (no network, security approval only for PyYAML yaml.safe_load(), etc.), you may take some other options

base64-encoding

Often when you have serialization woes, a practical solution is to base64-encode the data, though this may not be an option if you intend for it to be human-readable and modifiable

You can also include both serializations to get a visualization of the content (though I wouldn't recommend it), and you'll probably do better adding a title to sensibly describe each block

>>> import yaml  # pyyaml
>>> import base64
>>> base64.b64encode("foobar\n ".encode()).decode()
'Zm9vYmFyCiA='
>>> y = yaml.safe_load("""
... snippets:
...   snippetA:
...     title: a foobar with a newline and trailing space for baz
...     content: Zm9vYmFyCiA=
... """)
>>> base64.b64decode(y["snippets"]["snippetA"]["content"]).decode()
'foobar\n '

Reference another file

Sometimes, you can simply put the name of another file into your yaml document and then have your program load the referenced file however you see fit (ie. with open(path, "rb") as fh:)!

snippets:
  snippet A XML: path/snippetA.xml
  snippet B YML: path/snippetB.yml
Anthon
  • 69,918
  • 32
  • 186
  • 246
ti7
  • 16,375
  • 6
  • 40
  • 68