0

Given the following pattern:

group1: hello, group2: world
group1: hello (hello, world) world, group2: world
group1: hello world

of the style <group_name>: <group_value>[, <group_name>: <group_value>[...]].

In general I use the following regex to extract the values:

group1:\s(?P<group1>[^,\n]+)(:?,\sgroup2:\s(?P<group2>[^,\n]+))?\n

which works file unless a , exists inside the group_value.

I know that this toyexample can be solved by something like:

group1:\s(?P<group1>.+?)(?:,\sgroup2:\s(?P<group2>.+?))?\n

However I do want to protect myself agains matching everything accidentally so I would still like to limit my match when it encounters a ,.

Question: Is there a (general) way to match up to , and for that purpose ignore ,s that are in brackets?

  • Maybe `group1:\s(?P.*?)(?::?,\sgroup2:\s+(?P.*))?$`? Actually, it is a variation of your second solution, what is wrong with that one? – Wiktor Stribiżew Aug 18 '23 at 08:09
  • Could you please do post expected output in your question, thank you. – RavinderSingh13 Aug 18 '23 at 08:09
  • the output i expect is the same as my toyexample. however i dont want to match too much by accident with `.+?`. i would like to at least stop on `,` when i encounter one.. unless inside of ballanced brackets. – Lukas Unterschuetz Aug 18 '23 at 08:17
  • 1
    Maybe something like `group1:(?P(?>[^,\n()]*(?:(\((?:[^()]++|(?1))*\)))?)*)(?:, (?P.*))?\n` https://regex101.com/r/rzH1NG/1 – The fourth bird Aug 18 '23 at 08:18

1 Answers1

1

Using pcre, you could make use of a recursive pattern for balanced parenthesis with possessive quantifiers.

you define the pattern for group 1, and if the same logic applies for group 2 you can recurse the subpattern defined in group 1.

As you exclude matching a newline in the negated character class, you might use \h to match horizontal whitespace characters instead of using \s

\bgroup1:\h+(?P<group1>(?:[^,\n()]*(?:(\((?:[^()\n]+|(?2))*+\)))?)*+)(?:,\h+group2:\h+(?P<group2>\g<group1>))?\R
  • \bgroup1:\h+ Match the word group1 and then : and 1+ horizontal whitespace chars
  • (?P<group1> Named group1
    • (?: Non capture group
      • [^,\n()]* Match optional chars other than , newline ( or )
      • (?: Non capture group
        • (\((?:[^()\n]+|(?2))*+\)) Match balanced parenthesis recursing group 2
      • )? Close group and make it optional
    • )*+ Close the group and optionally repeat with a possessive quantifier (no backtracking)
  • ) Close group1
  • (?: Non capture group
    • ,\h+group2:\h+ Match group2: between horizontal whitespace chars
    • (?P<group2>\g<group1>) Named group2, recurse the subpattern in named group1
  • )? Close the non capture group and make it optional
  • \R Match a newline

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70