Parsing balanced nested wiki templates and extract a single line parameter's content by a regexp

Question

I know parsing nested strings or HTML is better done by a real parser but in my case I have simple templates and wanted to extract the title content of a Wiki parameter 'title' from a template. It took me a while to achieve this but thanks to the regex tool of Lars Olav Torvik (http://regex.larsolavtorvik.com/) and this user forum here I got to it. May be someone finds it useful. (We all want to contribute, he, won't we? ;-) The following code annotated with comments does the trick. I had to do it with look around assertions to get not two templates mixed together whe there is no title in one of them.

I'm not sure yet for the two questions in the regex comments—see (?# Questions: …)—if I understood the recursive part at (?R). Is it, that it obtains its content to check for from the outermost defined level, i.e. second regexp line \{\{ and last regexp line \}\}? Would that be correct? And what is the difference between ++ and + before the alternative of (?R) booth work equally, so it seems when tested.

The origninal wiki templates on a page (most simple):

$wikiTemplate = "
{{Templ1
| title = (1. template) title
}}

{{Templ2
| any parameter = something {{template}}
}}

{{Templ1
| title = (3. template) title
}}
";

The replacement:

$wikiTemplate = preg_replace(
  array(
  // tag all templates with START … END and add a TITLE-placeholder before
  // and take care of balanced {{ …  }} recursiveness 
    "@(?s)   (?# switch to dotall match, i.e. also linebreaks )
      \{\{ (?# find two {{ )
      (?: (?# group 1 as a non-backreferenced match  )
        (?:  (?# group 2 as a non-backreferenced match  )
          (?! (?# in group 1 anything but not {{ or }} )
            \{\{ 
            |   (?# or )
            \}\}
          )
          .
        )++  (?# Question: what is the differenc between ++ and + here? )
        |    (?# or )
        (?R) (?# is it recursive of what is defined in the outermost,
              i.e. 2nd regexp line with \{\{ and last line with \}\}
              Question: is that here understood correctly? ) 
      )
      * (?# zero or many times of the inner regexp defintions )
      \}\} (?# find two }} )
    @x",// x-extended → ignore white space in the pattern
  // replace TITLE by single line content of title parameter 
    "@
      (?<=TITLE) (?# TITLE must preceed the following linebreak but is not
                  backreferenced within \\0, i.e. the whole returned match)
      ([\n\r]+)  (?#linebr in 1 may also described as . because of
                  s-modifier dotall)
      (?:        (?# start non-backreferenced match )
        .        (?# any character but not followed by START)
        (?!START)
      )+      (?# multiple times)
      (?:     (?# start non-backreferenced match )
        \|\s*title\s*=\s* (?#find the parameter '| title = ')
      )
      ([^\r\n]+)  (?#get title now to \\2 but exclude the line break. 
                   Note it is buggy when there is no line break )
      (?:     (?# start non-backreferenced match )
        .     (?# any character but not followed by END)
        (?!END)
      )
      +       (?# multiple times)
      .       (?# any single character, e.g. the last  because as all
               stuff before captures anything not followed by END)
      (?:END) (?#a not backreferenced END)
    @msx", // m-multiline, s-dotall match also linebreaks,
           // x-extended → ignore white space in the pattern
  ), 
  array(
    "TITLE\nSTART\\0END", // \0 is the whole returned match, i.e. the template
  # replace the TITLE to  TITLEtitle contentTITLE…
    "\\2TITLE\\0",
  ),
  $wikiTemplate
);
print_r($wikiTemplate);

The output is then with the titles tagged by TITLE above each template but only if there was a title:

TITLE(1. template) titleTITLE
START{{Templ1
 | title = (1. template) title
}}END

TITLE
START{{Templ2
 | any parameter = something {{template}}
}}END

TITLE(3. template) titleTITLE
START{{Templ1
 | title = (3. template) title
}}END

Any inside for my questions regarding regexp understanding, or some improvements? Thanks, Andreas.

score 0 · Answer 1 · edited Apr 13 '17 at 12:40

++ is a possessive quantifier. If you append any repetition quantifier (+, *, {...}) with a + it gets possessive. That means that the regex engine will not backtrack and try less repetitions, once it leaves the repetition for the first time. So they basically make the repetition an atomic group. Sometimes this is an optimization and sometimes it actually makes a difference. You can do some very good reading here.

And about your second question yes (?R) will simply try to match full pattern again. For this there is a good article to be found in the PHP documentation of PCRE.

For your other questions, a better place to ask this might be on Code Review.

Parsing balanced nested wiki templates and extract a single line parameter's content by a regexp

1 Answers1