I know parsing nested strings or HTML is better done by a real parser but in my case I have simple templates and wanted to extract the title content of a Wiki parameter 'title' from a template. It took me a while to achieve this but thanks to the regex tool of Lars Olav Torvik (http://regex.larsolavtorvik.com/) and this user forum here I got to it. May be someone finds it useful. (We all want to contribute, he, won't we? ;-) The following code annotated with comments does the trick. I had to do it with look around assertions to get not two templates mixed together whe there is no title in one of them.
I'm not sure yet for the two questions in the regex comments—see (?# Questions: …)
—if I understood the recursive part at (?R)
. Is it, that it obtains its content to check for from the outermost defined level, i.e. second regexp line \{\{
and last regexp line \}\}
? Would that be correct? And what is the difference between ++
and +
before the alternative of (?R)
booth work equally, so it seems when tested.
The origninal wiki templates on a page (most simple):
$wikiTemplate = " {{Templ1 | title = (1. template) title }} {{Templ2 | any parameter = something {{template}} }} {{Templ1 | title = (3. template) title }} ";
The replacement:
$wikiTemplate = preg_replace( array( // tag all templates with START … END and add a TITLE-placeholder before // and take care of balanced {{ … }} recursiveness "@(?s) (?# switch to dotall match, i.e. also linebreaks ) \{\{ (?# find two {{ ) (?: (?# group 1 as a non-backreferenced match ) (?: (?# group 2 as a non-backreferenced match ) (?! (?# in group 1 anything but not {{ or }} ) \{\{ | (?# or ) \}\} ) . )++ (?# Question: what is the differenc between ++ and + here? ) | (?# or ) (?R) (?# is it recursive of what is defined in the outermost, i.e. 2nd regexp line with \{\{ and last line with \}\} Question: is that here understood correctly? ) ) * (?# zero or many times of the inner regexp defintions ) \}\} (?# find two }} ) @x",// x-extended → ignore white space in the pattern // replace TITLE by single line content of title parameter "@ (?<=TITLE) (?# TITLE must preceed the following linebreak but is not backreferenced within \\0, i.e. the whole returned match) ([\n\r]+) (?#linebr in 1 may also described as . because of s-modifier dotall) (?: (?# start non-backreferenced match ) . (?# any character but not followed by START) (?!START) )+ (?# multiple times) (?: (?# start non-backreferenced match ) \|\s*title\s*=\s* (?#find the parameter '| title = ') ) ([^\r\n]+) (?#get title now to \\2 but exclude the line break. Note it is buggy when there is no line break ) (?: (?# start non-backreferenced match ) . (?# any character but not followed by END) (?!END) ) + (?# multiple times) . (?# any single character, e.g. the last because as all stuff before captures anything not followed by END) (?:END) (?#a not backreferenced END) @msx", // m-multiline, s-dotall match also linebreaks, // x-extended → ignore white space in the pattern ), array( "TITLE\nSTART\\0END", // \0 is the whole returned match, i.e. the template # replace the TITLE to TITLEtitle contentTITLE… "\\2TITLE\\0", ), $wikiTemplate ); print_r($wikiTemplate);
The output is then with the titles tagged by TITLE above each template but only if there was a title:
TITLE(1. template) titleTITLE START{{Templ1 | title = (1. template) title }}END TITLE START{{Templ2 | any parameter = something {{template}} }}END TITLE(3. template) titleTITLE START{{Templ1 | title = (3. template) title }}END
Any inside for my questions regarding regexp understanding, or some improvements? Thanks, Andreas.