2

I would like to find comment tags in a string that are NOT already inside a <pre> tag, and wrap them in a <pre> tag.

It seems like there's no way of 'finding' comments using the PHP DOM.

I'm using regex to do some of the processing already, however I am very unfamiliar with (have yet to grasp or truly understand) look aheads and look behinds in regex.

For instance I may have the following code;

<!-- Comment 1 -->

<pre>
    <div class="some_html"></div>
    <!-- Comment 2 -->
</pre>

I would like to wrap Comment 1 in <pre> tags, but obviously not Comment 2 as it already resides in a <pre>.

How would this usually be done in RegEx?

Here's kind of what I've understood about negative look arounds, and my attempt at one, I'm clearly doing something very wrong!

(?<!<pre>.*?)<!--.*-->(?!.*?</pre>)

Joel
  • 2,185
  • 4
  • 29
  • 56
  • [You could use](http://stackoverflow.com/q/11977896/1633117) the [PHP Simple HTML DOM Parser](http://simplehtmldom.sourceforge.net/) instead. – Martin Ender Aug 16 '13 at 09:07
  • Or [one of the other countless alternatives](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php). – Martin Ender Aug 16 '13 at 09:12
  • Neither of these "links" answer the question. I will put some things I have attempted to make the question more specific. I would rather not use an external library if possible. – Joel Aug 16 '13 at 09:27
  • Do you control the input HTML directly, so that you can ensure that there is no JavaScript or Comments containing `
    `, no CDATA blocks, and no nested comments or `
    ` blocks? If you can not ensure this, there is probably no sensible solution using regex. If you can, I'll try to give one =)
    – Jens Aug 16 '13 at 09:33
  • Yeah, there should really only be
     tags containing html, and no other code containing 
    .
    – Joel Aug 16 '13 at 09:34
  • 1
    @Joel the problem with regex is, PCRE does not support lookbehinds of variable length. So while your attempt is actually pretty sound (except for some greediness problems), it would only work in .NET. This is why it's near impossible to solve this robustly with regex. – Martin Ender Aug 16 '13 at 09:37
  • I see, so does the lookbehind mean "directly preceeded" by? – Joel Aug 16 '13 at 09:39
  • @Joel, yes, in any case. Which is why using `.*?` to allow for the `
    ` to be anywhere left of that position is correct. But unfortunately, most regex engines require lookbehinds to have a fixed length (which is violated by the use of `*`).
    – Martin Ender Aug 16 '13 at 09:44
  • Thank you :) That's good to know. I'm considering, replacing ALL comment tags with `
    ` (or a temporary tag e.g. ``) .. then remove any of those tags found within `
    ` tags. :P Not as nice as a regex, but it works.
    – Joel Aug 16 '13 at 09:57
  • I say 'replacing' I mean surrounding. Then removing any that were already inside
     tags.
    – Joel Aug 16 '13 at 10:14
  • @m.buettner: Actually, look-behinds are not needed for this; look-aheads are sufficient and they may be of variable length. See my answer. – Jens Aug 16 '13 at 12:58
  • @Jens sure, if we're assuming valid HTML and that there are no other `pre`s inside comments or CDATA (or the attributes you mentioned, of course). – Martin Ender Aug 16 '13 at 13:02

4 Answers4

2

You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.

Having said that, here's what you could (but should not, see above) do:

First, identify comments, e.g. using

<!-- (?:(?!-->).)*-->

The negative look-ahead block ensures that the .* does not run out of the comment block.

Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.

So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.

This would look like

(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

So, together this would be

<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

A hurray for write-only code =)

The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.

Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.

Jens
  • 25,229
  • 9
  • 75
  • 117
  • Despite the fact you suggest not actually using this method, I will mark this as the answer as it succinctly answers my question re: regex, and is useful. Thank you! :) – Joel Aug 16 '13 at 13:17
1

It seems like there's no way of 'finding' comments using the PHP DOM.

Of course you can... Check this code using PHP Simple HTML DOM Parser:

<?php
$text = '<!-- Comment 1 -->

        <pre>
            <div class="some_html"></div>
            <!-- Comment 2 -->
        </pre>';

echo  "<div>Original Text: <xmp>$text</xmp></div>";

$html = str_get_html($text);

$comments = $html->find('comment');

// if find exists
if ($comments) {

  echo '<br>Find function found '. count($comments) . ' results: ';

  foreach($comments as $key=>$com){
    echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
  }
}
else
  echo "Find() fails !";
?>

$com->innertext will give you the comments like <!-- Comment 1 -->...

You have now just to clean them as you wish. For example using <!--\s*(.*)\s*-->... Try it HERE

Edit:

Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

Source: http://www.regular-expressions.info/lookaround.html

Community
  • 1
  • 1
Enissay
  • 4,969
  • 3
  • 29
  • 56
  • Using the simple html dom external library, yes. Not using the native PHP DOM class – Joel Aug 16 '13 at 10:06
0

Xpath is your friend:

$xpath = new DOMXpath($doc);

foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
  $pre = $doc->createElement("pre");
  $comment->parentNode->insertBefore($pre, $comment);
  $pre->appendChild($comment);
}
pguardiario
  • 53,827
  • 19
  • 119
  • 159
0

its quite easy, using a principle called the stack-counter,
essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed.
if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>".
in that case, simply return back the match, unmodified - simple as that.