1

I want to match nested Wiki functions or wiki parser functions that start with a functionname and then a colon, but as soon as I try to get the recursive pcre regex working with a 1st level test I fail to construct a regex pattern. I want to match with the test that it starts with {{aFunctionName: followed by colon, in regex {{[\w\d]+: the test text can look like

1 {{DEFAULTSORT: shall be matched {{PAGENAME}} }}
2 {{DEFAULTSORT: shall be matched }}
3 {{DEFAULTSORT: shall be matched {{PAGENAMEE: some text}} }}
4 Lorem ipsum {{VARIABLE shall not be matched}}
5 {{Some template|param={{VARIABLE}} shall not be matched }}

I'm able to

  • to get any nested curly braces using {{(?:(?:(?!{{|}}).)++|(?R))*}}
    which gets line 1, 2, 3, partially 4 and 5
  • to get any nested wiki function using ({{(?:[\w\d]+:)(?:(?:(?!{{|}}).)++|(?1))*}})
    which only gets line 3 but I also want to match lines 1 and 2.

But I have no idea how to construct a regex pattern that tests something like (written as pseudo code):

{{match1st-level-Function: then anything {{nested}} or not nested }}
{{do not match simple {{nested}} things}}

Any help from a pcre regex expert? Thank you!

1 Answers1

2

Use something like this:

{{\w+:([^{}]*+(?:{{(?1)}}[^{}]*)*+)}}

To obtain a recursive pattern, the use of (?R) isn't mandatory, you can also refer to any capture group opened before with its number, its relative position (from the current position), or its name (when you use named captures).

Other possible syntaxes are:

{{\w+:([^{}]*+(?:{{(?-1)}}[^{}]*)*+)}}
#                    ^------ relative reference: the last group on the left

{{\w+:([^{}]*+(?:{{\g<1>}}[^{}]*)*+)}}
#                  ^----- oniguruma syntax

{{\w+:([^{}]*+(?:{{\g<-1>}}[^{}]*)*+)}}
#                  ^----- relative with oniguruma syntax

{{\w+:(?<name>[^{}]*+(?:{{\g<name>}}[^{}]*)*+)}}
#                         ^---- named capture (oniguruma)

{{\w+:(?<name>[^{}]*+(?:{{(?&name)}}[^{}]*)*+)}}
#                         ^---- named capture (perl syntax)

All these syntaxes can be used with pcre.

If you absolutely want to use the whole pattern for your recursion, you can eventually use a conditional statement to test if you are in a nested part or not:

{{(?(R)|\w+:)[^{}]*+(?:(?R)[^{}]*)*+}}

The conditional is (?(R)|\w+:) and follows this schema: (?(condition) True | False)

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Thanks. Testing performance just for `{{…}}` on https://regex101.com/ I realize that `{{(?:(?:(?!{{|}}).)*+|(?R))*}}` takes much longer and more steps than with `{{([^{}]*+(?:{{(?1)}}[^{}]*)*+)}}` (compared almost 100:1 times). Could you explain why? Or at which step in the pattern the search engine tries to work on more possibilities than in the other pattern? – andreas.naturwiki May 20 '16 at 07:59
  • @andreas.naturwiki: `(?:(?!not_that).)*` is a slow because for each character the lookahead (and the subpattern inside) must be tested. Writing `(?:this+|that+)*+` is already much faster (greedy quantifiers, possessive quantifier to prevent backtracking, only the alternation needs to be tested). But a better way consists to "unroll" the pattern to avoid this alternation test: `this*+(?:that+this*)*+` – Casimir et Hippolyte May 20 '16 at 09:57
  • @andreas.naturwiki: Note that in addition, this pattern has no constraint, because it can match an empty string, and all that have been matched by the pattern becomes atomic, due to the use of possessive quantifiers. That construct produces near to zero backtracking, and needs few steps in particular when "this" is a character class. – Casimir et Hippolyte May 20 '16 at 10:05
  • @andreas.naturwiki: Before discussing about performance, the two regex in your comment are not equivalent. Your first regex matches lone `{` and `}`, while the second one does not. – nhahtdh May 25 '16 at 09:46