-2

This is what I got now:

/{% if(.+?) %}(.*?){% endif %}/gusi

It catches multiple if statements etc just fine.

IMG: http://image.xesau.eu/2015-02-07_23-22-11.png

But when I do nested ones, so an if in an if, it stops at the first occurence of {% endif %}

IMG: http://image.xesau.eu/2015-02-08_09-29-43.png

Is there a way catch as many {% endif %} statements as there were {% if ... %} statements, and if so, how?

Xesau
  • 161
  • 1
  • 9
  • What language are you using? – Maroun Feb 08 '15 at 08:34
  • PHP Regular Expressions. (preg_ functions) I added that in the tags, but someone (@Michael9) said I'd better remove it. Will revert it. – Xesau Feb 08 '15 at 08:35
  • 4
    You *really* shouldn't be using regexen for this, a parser will do a much better and precise job. This looks like Twig syntax; if that's the case, Twig has a fantastic parser which you can hijack/extend/appropriate for this task. – deceze Feb 08 '15 at 08:39
  • @deceze That the exact purpose of it. I am trying to write a configurable template parser, which you can set up to read variables like for example {{ var }} or {$var} or however you want it. I though have never tried using blocks and stuff, seems too complicated for me xD – Xesau Feb 08 '15 at 08:40
  • Are you trying to match outermost `{% if(.+?) %}(.*?){% endif %}` or innermost one? – anubhava Feb 08 '15 at 08:46
  • Neither. I am trying to catch the nth `{% endif %}`, where `n` is the amount of `{% if ... %}`s – Xesau Feb 08 '15 at 08:47
  • 1
    It also looks similar to Smarty template syntax. If it's an existing template language, use the provided/native library/extension. If you're trying to roll your own template language, best to stop now before you run into the numerous further issues you'll encounter and use an existing template engine. Templating requires extensive focus and testing on possible input streams and needs very well-defined lexer/parser/tokenizer ruleset beyond a handful of simple regex expressions. – Anthony Feb 08 '15 at 09:03
  • @Anthony http://stackoverflow.com/questions/28392033/how-to-catch-nested-if-endif-statments-with-regex?noredirect=1#comment45121411_28392200 – Xesau Feb 08 '15 at 09:07
  • @anubhava Just re-reading your question. I misinterpreted it the first time. I am trying to match the outermost, yes. – Xesau Feb 08 '15 at 09:08

2 Answers2

5

Don't use regexen, use the existing Twig parser. Here's a sample of an extractor I wrote which parses for custom tags and extracts them: https://github.com/deceze/Twig-extensions/tree/master/lib/Twig/Extensions/Extension/Gettext

The job of the lexer is to turn Twig source code into objects; you can extend it if you need to hook into that process:

class My_Twig_Lexer extends Twig_Lexer {

    ...

    /**
     * Overrides lexComment by saving comment tokens into $this->commentTokens
     * instead of just ignoring them.
     */
    protected function lexComment() {
        if (!preg_match($this->regexes['lex_comment'], $this->code, $match, PREG_OFFSET_CAPTURE, $this->cursor)) {
            throw new Twig_Error_Syntax('Unclosed comment', $this->lineno, $this->filename);
        }
        $value = substr($this->code, $this->cursor, $match[0][1] - $this->cursor);
        $token = new Twig_Extensions_Extension_Gettext_Token(Twig_Extensions_Extension_Gettext_Token::COMMENT, $value, $this->lineno);
        $this->commentTokens[] = $token;
        $this->moveCursor($value . $match[0][0]);
    }

    ...

}

Typically Twig comment nodes are being discarded by Twig, this lexer saves them.

However, your main concern will be to work with the parser:

$twig   = new Twig_Environment(new Twig_Loader_String);
$lexer  = new My_Twig_Lexer($twig);
$parser = new Twig_Parser($twig);

$source = file_get_contents($file);
$tokens = $lexer->tokenize($source);
$node   = $parser->parse($tokens);
processNode($node);

$node here is the root node of a tree of nodes which represent the Twig source in an object oriented fashion, all correctly parsed already. You just need to process this tree without having to worry about the exact syntax which was used to produce it:

 processNode(Twig_NodeInterface $node) {
      switch (true) {
          case $node instanceof Twig_Node_Expression_Function :
              processFunctionNode($node);
              break;
          case $node instanceof Twig_Node_Expression_Filter :
              processFilterNode($node);
              break;
      }

      foreach ($node as $child) {
          if ($child instanceof Twig_NodeInterface) {
              processNode($child);
          }
      }
 }

Just traverse it until you find the kind of node you're looking for and get its information. Play around with it a bit. This example code may or may not be a bit outdated, you'll have to dig into the Twig parser source code anyway to understand it.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • I'm sorry, but I can't accept two answers as best, so I took the most straightforward one. Your explanation is a really good one, but I was looking for a working recursive regex pattern, also just to learn how they work. Nevertheless thanks for your Twig thingy (dunno how to call it) – Xesau Feb 08 '15 at 08:58
  • You'll be happy with a regex solution until it'll almost inevitably break on some special case one day. Or until you need to extend it. Just saying... :o) – deceze Feb 08 '15 at 08:59
  • Yes, I know. Here's why I want to use it nonetheless: I often create small PHP projects. Because it's a pain to understand what's going on when PHP and HTML are flowing though each other, I always use a template engine. Because most template engines are too complex for simple projects, I decided to just make my own that does simple things like if-then-else and loops. It's not meant for any complex things. For complex I already use Twig, also why I chose for that syntax. Saves me designing one myself. – Xesau Feb 08 '15 at 09:06
  • There are a ton of complex and simple template languages already out there. Take a look at Mustache, it hardly gets any simpler than that. Twig is also extremely simple if you just use its simple parts, but can be powerful if you need it to. It sounds like a Very Bad Idea™ to invent your own language **based on regexen**. They're simply not the tool for the job. I appreciate that lexers and parsers may be an entirely foreign thing, but they're really the way to go. But suit yourself... :P – deceze Feb 08 '15 at 09:08
  • 1
    @aronvanwillige - the trick to avoiding the php/HTML flow issue is to not interweave the two. A good template engine should be less painful than trying to create and parse your own. – Anthony Feb 08 '15 at 09:12
  • @decdeze I know you are right. I have been knowing that ever since I posted this question. What I hate though, is that my projects exist of 90% template engine and 10% own code. As I already said, I use Twig for big projects. This is just a qad-solution to prevent me from making php functions() to render the page, because it just *is* ugly. – Xesau Feb 08 '15 at 09:14
  • @Anthony Good template engines are not a pain. I have worked with Rain (2 and 3), Smarty, Twig. All of them worked really well for me, but it's just to avoid PHP and HTML flowing through each other. It is not meant for any public projects or whatsoever. – Xesau Feb 08 '15 at 09:16
  • 1
    Well, made this as accepted answer :) Although it doesn't answer my question, you seemed to really like it so... – Xesau Feb 08 '15 at 09:19
3

It is almost trivial to change your pattern into a recursive pattern:

{% if(.+?) %}((?>(?R)|.)*?){% endif %}

Working example: https://regex101.com/r/gX8rM0/1

However, that would be a bad idea: the pattern is missing many cases, which are really bugs in your parser. Just a few common examples:

  • Comments:

    {% if aaa %}
    123
    <!-- {% endif %} -->
    {% endif %}
    
  • String literals:

    {% if aaa %}a = "{% endif %}"{% endif %}
    
    {% if $x == "{% %}" %}...{% endif %}
    
  • Escaped characters (you do need escaped characters, right?):

    <p>To start a condition, use <code>\{% if aaa %}</code></p>
    
  • Invalid input:
    It would be nice if the parser can work relatively well on invalid input, and point to the correct position of the error.

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • This is a really good piece of code, Kobi. I found solutions for some problems: 1. The comments: These can be filtered out before any of the magic happens. `preg_replace ('/<\!--(.*?)-->/usi', '', $context);` I suppose? 2. Escaped characters can be changed into HTML entities on the beforehand, using `preg_replace_callback` 3. String literals: same as escaped characters. Your example though does not make really much sense, since logic is used in (HTML) output of the condition. – Xesau Feb 08 '15 at 09:02
  • Any suggestions for the OP on either a solid templating framework or, if they insist on homegrown solution, a well defined group of tools for implementing custom lexicon for lexer/parser/tokenizer? Just figure its better to point to lower level tools than have them think they need to see your answer as a challenge/obstacle. – Anthony Feb 08 '15 at 09:07
  • 2
    @Anthony - My answer is 159 characters, followed by 605 characters of caveats (and those are just *examples*). I thought that would be enough, especially with Deceze's answer. – Kobi Feb 08 '15 at 09:12
  • Fair enough. I didn't expect the chatter to be so lively, and was thinking OP might need some direction. – Anthony Feb 08 '15 at 09:15