0

I have to process a lot of text that contains a number of YAML blocks, as follows:

key1:
- value1
- value2
key1:
- value1
- value2
- value3

The number of values per key can vary. I want to extract the different key-value pairs, because I have to check whether they are formatted in a certain way. My idea was to use the following regex (which I also checked with regexr):

(.*):\n(-\ .*\n)*

and using it with re.findall() and the re.VERBOSE flag. However, this results in

[('key1', '- value3\n'), ('key2', '- value3\n')]

not, as I would expect

[('key1', '- value1\n', '- value2\n'), ('key', '- value1\n', '- value2\n', '- value3\n']`

What's confusing me even more is that if I use

(.*):\n(-\ .*\n)(-\ .*\n)

or

(.*):\n(-\ .*\n)(-\ .*\n)(-\ .*\n)

so explicitly writing out the value term two or three times, it works fine. This is of course not what I want; I want to catch a variable number of values per key.

I'm using Python 3.8 on Windows.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Maaike
  • 122
  • 9
  • 4
    Why are you doing this with *regex*? Use a YAML parser. – jonrsharpe Mar 09 '20 at 07:53
  • 2
    Your solution is here: https://stackoverflow.com/questions/50846431 , please use a parser as @jonrsharpe has mentioned, it's much simpler. – Shanyl Ong Mar 09 '20 at 07:54
  • @jonrsharpe The YAML blocks are only a small part of the total files, and are not always correctly formatted. Basically, I have to check a lot of manually created files for structure (e.g. correct MD headers in correct order) and formatting (e.g. some paragraphs should be bolded depending on surrounding elements). I've been using increasingly complex regexes to filter out the most basic to the most obscure errors, so that is why my first approach was to use regex here as well. Regardless, thanks for the answer, I'll check out the parser. – Maaike Mar 09 '20 at 08:11
  • @shanylong Thanks for the link, I'll check it out. – Maaike Mar 09 '20 at 08:12

1 Answers1

1

Your regex defines two capture groups so the resulting matches contain the values of the two groups respectively. In case one of the groups is repeated (via * in your case) it contains the last matching value. If you want all matches from the repeated group you can embed it into another group:

(.*):\n((?:- .*\n)*)

The result contains all the - value* as one string, so you need to split manually on '\n':

result = {k: v.split('\n') for k, v in re.findall('(.*):\n((?:- .*\n)*)', text)}
a_guest
  • 34,165
  • 12
  • 64
  • 118
  • While the comments to my original question point me in a more fruitful direction, this answer clarifies what was going wrong in the first place. Thanks! – Maaike Mar 09 '20 at 09:01