I have to process a lot of text that contains a number of YAML blocks, as follows:
key1:
- value1
- value2
key1:
- value1
- value2
- value3
The number of values per key can vary. I want to extract the different key-value pairs, because I have to check whether they are formatted in a certain way. My idea was to use the following regex (which I also checked with regexr):
(.*):\n(-\ .*\n)*
and using it with re.findall()
and the re.VERBOSE
flag. However, this results in
[('key1', '- value3\n'), ('key2', '- value3\n')]
not, as I would expect
[('key1', '- value1\n', '- value2\n'), ('key', '- value1\n', '- value2\n', '- value3\n']`
What's confusing me even more is that if I use
(.*):\n(-\ .*\n)(-\ .*\n)
or
(.*):\n(-\ .*\n)(-\ .*\n)(-\ .*\n)
so explicitly writing out the value term two or three times, it works fine. This is of course not what I want; I want to catch a variable number of values per key.
I'm using Python 3.8 on Windows.