29

I'm making a document generator from YAML data, which would specify which line of the YAML file each item is generated from. What is the best way to do this? So if the YAML file is like this:

- key1: item 1
  key2: item 2
- key1: another item 1
  key2: another item 2

I want something like this:

[
     {'__line__': 1, 'key1': 'item 1', 'key2': 'item 2'},
     {'__line__': 3, 'key1': 'another item 1', 'key2': 'another item 2'},
]

I'm currently using PyYAML, but any other library is OK if I can use it from Python.

puzzlet
  • 390
  • 3
  • 11
  • For further inspiration, here's my code for this. It contains more information than requested above as it reports the location information using start_mark, end_mark on each dict/list/unicode (using dict_node, list_node, unicode_node subclasses, respectively). https://gist.github.com/dagss/5008118 – Dag Sverre Seljebotn Jan 16 '13 at 09:30

4 Answers4

20

Here's an improved version of puzzlet's answer:

import yaml
from yaml.loader import SafeLoader

class SafeLineLoader(SafeLoader):
    def construct_mapping(self, node, deep=False):
        mapping = super(SafeLineLoader, self).construct_mapping(node, deep=deep)
        # Add 1 so line numbering starts at 1
        mapping['__line__'] = node.start_mark.line + 1
        return mapping

You can use it like this:

data = yaml.load(whatever, Loader=SafeLineLoader)
augurar
  • 12,081
  • 6
  • 50
  • 65
  • You fail to mention that, on uncontrolled YAML input, this can get your disk wiped (or worse). With the OPs example input there is no need to play it unsafe and you could just subclass `SafeLoader` instead of `Loader`. I also fail to see how this would address getting the line number of the sequence in the OP's (or any other) YAML document. – Anthon Dec 06 '18 at 11:20
  • @Anthon In the current version of PyYaml, `Loader` is the same as `SafeLoader`. – augurar Dec 08 '18 at 03:30
  • @Anthon This is an enhancement of puzzlet's answer with the same behavior. It adds a key `__line__` to each mapping in the YAML structure with a value of the starting line of that mapping node. – augurar Dec 08 '18 at 03:36
  • Which version of PyYAML are you using? PyYAML 4.0 finally had that security hole fixed, but that version has been retracted half a year ago. In the latest version of PyYAML on PyPI ( [3.13](https://pypi.org/project/PyYAML/)) `Loader` uses the unsafe `Constructor` and `SafeLoader` uses `SafeConstructor`. (`loader.py` lines 38 resp. 28) – Anthon Dec 08 '18 at 07:49
  • @Anthon Hm, I was looking at the latest code on their develop branch. I'll change to `SafeLoader` just to be clear. – augurar Dec 11 '18 at 02:14
  • 1
    I get the line index starting from 1 in pyyaml 5.3.1. – marko.ristin Nov 28 '20 at 21:01
  • Does anyone know a good, robust way to modify this to only attach `__line__` to the top-most level of the yaml? Best I could come up with is `if node.start_mark.column == 2: mapping["__line__"] = node.start_mark.line + 1`, but it feels hacky to do it based on the column number, and could possibly fail if the yaml is formatted differently? – V. Rubinetti Jan 04 '21 at 16:44
  • 1
    @V.Rubinetti Since the constructor performs a depth-first traversal, you could add an attribute to track current depth and override `construct_object()` to increment/decrement it appropriately. You'd need some extra logic to handle anchors correctly, if needed for your use case. – augurar Jan 31 '21 at 20:39
14

I've made it by adding hooks to Composer.compose_node and Constructor.construct_mapping:

import yaml
from yaml.composer import Composer
from yaml.constructor import Constructor

def main():
    loader = yaml.Loader(open('data.yml').read())
    def compose_node(parent, index):
        # the line number where the previous token has ended (plus empty lines)
        line = loader.line
        node = Composer.compose_node(loader, parent, index)
        node.__line__ = line + 1
        return node
    def construct_mapping(node, deep=False):
        mapping = Constructor.construct_mapping(loader, node, deep=deep)
        mapping['__line__'] = node.__line__
        return mapping
    loader.compose_node = compose_node
    loader.construct_mapping = construct_mapping
    data = loader.get_single_data()
    print(data)
puzzlet
  • 390
  • 3
  • 11
7

If you are using ruamel.yaml >= 0.9 (of which I am the author), and use the RoundTripLoader, you can access the property lc on collection items to get line and column where they started in the source YAML:

def test_item_04(self):
    data = load("""
     # testing line and column based on SO
     # http://stackoverflow.com/questions/13319067/
     - key1: item 1
       key2: item 2
     - key3: another item 1
       key4: another item 2
        """)
    assert data[0].lc.line == 2
    assert data[0].lc.col == 2
    assert data[1].lc.line == 4
    assert data[1].lc.col == 2

(line and column start counting at 0).

This answer show how to add the lc attribute to string types during loading.

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • Couldn'd find a way to let this work if the list is inside an ordered map, like in `key1: !!omap\n - key4: item2\n - key3: item3` it's not possible to access to `key4` and `key3` line numbers. – zezollo Aug 16 '17 at 11:42
  • @zezollo an orderedmap doesn't by default get loaded into a CommentedMap structure and doesn't therefore have the `lc` attribute. You would have to register the !omap loading as subclass of CommentedMap. That is doable, but more than I can answer in a comment. You should post a new question if you cannot figure out how to do that. – Anthon Aug 16 '17 at 11:59
  • Indeed I cannot figure this out. I've only found a "dirty" workaround to get the lines numbers. Question asked [here](https://stackoverflow.com/questions/45716281/parsing-yaml-get-line-numbers-even-in-ordered-maps). – zezollo Aug 16 '17 at 14:27
6

The following codes are based on previous good answers, if anyone also needs to locate leaf attributes' line numbers, the following codes may help:

from yaml.composer import Composer
from yaml.constructor import Constructor
from yaml.nodes import ScalarNode
from yaml.resolver import BaseResolver
from yaml.loader import Loader


class LineLoader(Loader):
    def __init__(self, stream):
        super(LineLoader, self).__init__(stream)

    def compose_node(self, parent, index):
        # the line number where the previous token has ended (plus empty lines)
        line = self.line
        node = Composer.compose_node(self, parent, index)
        node.__line__ = line + 1
        return node

    def construct_mapping(self, node, deep=False):
        node_pair_lst = node.value
        node_pair_lst_for_appending = []

        for key_node, value_node in node_pair_lst:
            shadow_key_node = ScalarNode(tag=BaseResolver.DEFAULT_SCALAR_TAG, value='__line__' + key_node.value)
            shadow_value_node = ScalarNode(tag=BaseResolver.DEFAULT_SCALAR_TAG, value=key_node.__line__)
            node_pair_lst_for_appending.append((shadow_key_node, shadow_value_node))

        node.value = node_pair_lst + node_pair_lst_for_appending
        mapping = Constructor.construct_mapping(self, node, deep=deep)
        return mapping


if __name__ == '__main__':
    stream = """             # The first line
    key1:                    # This is the second line
      key1_1: item1
      key1_2: item1_2
      key1_3:
        - item1_3_1
        - item1_3_2
    key2: item 2
    key3: another item 1
    """
    loader = LineLoader(stream)
    data = loader.get_single_data()

    from pprint import pprint

    pprint(data)

And the output are as follows, with another key with prefix "__line__", like "__line__key" at the same level.

PS: For the list items, I cannot show the line yet.

{'__line__key1': 2,
 '__line__key2': 8,
 '__line__key3': 9,
 'key1': {'__line__key1_1': 3,
          '__line__key1_2': 4,
          '__line__key1_3': 5,
          'key1_1': 'item1',
          'key1_2': 'item1_2',
          'key1_3': ['item1_3_1', 'item1_3_2']},
 'key2': 'item 2',
 'key3': 'another item 1'}
Menglong Li
  • 2,177
  • 14
  • 19