1

I have a basic text template engine that uses a syntax like this:

foo bar
%IF MY_VAR
  some text
  %IF OTHER_VAR
    some other text
  %ENDIF
%ENDIF
bar foo

I have an issue with the regular expression that I am using to parse it whereby it is not taking into account the nested IF/ENDIF blocks.

The current regex I'm using is: %IF (?<Name>[\w_]+)(?<Contents>.*?)%ENDIF

I have been reading up on balancing capture groups (a feature of .NET's regex library) as I understand this is the recommended way of supporting "recursive" regex's in .NET.

I've been playing with balancing groups and have so far came up with the following:

(
 (
  (?'Open'%IF\s(?<Name>[\w_]+))
  (?<Contents>.*?)
 )+
 (
  (?'Close-Open'%ENDIF)(?<Remainder>.*?)
 )+
)*
(?(Open)(?!))

But this is not behaving entirely how I would expect. It is for instance capturing a lot of empty groups. Help?

nbevans
  • 7,739
  • 3
  • 27
  • 32

1 Answers1

5

To capture a whole IF/ENDIF block with balanced IF statements, you can use this regex:

%IF\s+(?<Name>\w+)
(?<Contents>
    (?> #Possessive group, so . will not match IF/ENDIF
        \s|
        (?<IF>%IF)|     #for IF, push
        (?<-IF>%ENDIF)| #for ENDIF, pop
        . # or, anything else, but don't allow
    )+
    (?(IF)(?!)) #fail on extra open IFs
)   #/Contents
%ENDIF

The point here is this: you cannot capture in a single Match more than one of every named group. You will only get one (?<Name>\w+) group, for example, of the last captured value. In my regex, I kept the Name and Contents groups of your simple regex, and limited the balancing inside the Contents group - the regex is still wrapped in IF and ENDIF.

If becomes interesting when your data is more complex. For example:

%IF MY_VAR             
  some text
  %IF OTHER_VAR
    some other text
  %ENDIF
  %IF OTHER_VAR2
    some other text 2
  %ENDIF
%ENDIF                 
%IF OTHER_VAR3         
    some other text 3
%ENDIF                 

Here, you will get two matches, one for MY_VAR, and one for OTHER_VAR3. If you want to capture the two ifs on MY_VAR's content, you have to rerun the regex on its Contents group (you can get around it by using a lookahead if you must - wrap the whole regex in (?=...), but you'll need to put it into a logical structure somehow, using positions and lengths).

Now, I won't explain too much, because it seems you get the basics, but a short note about the contents group - I've uses a possessive group to avoid backtracking. Otherwise, it would be possible for the dot to eventually match whole IFs and break the balance. A lazy match on the group would behave similarly (( )+? instead of (?> )+).

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • 1
    All that aside, consider using a parser, it should take care of it easily. – Kobi Nov 26 '10 at 15:09
  • That's brilliant. Thank you very much. I've added some recursion into my program so that it traverses the nested if/endif blocks. – nbevans Nov 26 '10 at 16:04
  • Minor issue that's hopefully easy to fix... consider this input text: `%IF MY_VAR some text %IF OTHER_VAR %ENDIF`. Notice that the inner OTHER_VAR is not closed with an ENDIF. However the regex matches the inner block rather than matching on the outer block. How can I make the regex, in this particular scenario (i.e. malformed template), match on the outer-most block rather than jumping straight to the inner block? – nbevans Nov 26 '10 at 16:44
  • @NathanE - Good question! I was going to add a paragraph about invalid input, but though it was too long as it is... I'm not sure there's an easy way to solve that problem - the balancing groups are there to *avoid matching* in case of mismatched groups, it's hard to add rule for the "right" behavior in that case (for that matter, this regex captures a balanced if, which is "right"). You might fall back on your original regex. If you want to define clear rules, you might need to write a grammar and parser - it allows fine tuning, and can provide better parsing error messages. – Kobi Nov 26 '10 at 17:36
  • No problem. It is still a great regex you came up with Kobi. If the need arises in the future to have a more graceful error handling condition for a malformed template, or maybe to add more types of tokens to the template language, then I will very likely just redesign it to be a proper parsing engine. I think the regex will do just fine until that day though! – nbevans Nov 26 '10 at 17:44
  • Looking again at the question, it is probably possible to capture all IFs in a single Match, using Groups and Captures. Maybe I'll update the answer if needed, I didn't know that at the time... – Kobi Jun 16 '11 at 17:18