0

How can I convert the code inside the <code> and <pre> tags to html entities ?

<code class="php"> <div> a div.. </div> </code>

<pre class="php">
<div> a div.. </div>
</pre>

<div> this should be ignored </div>
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Alex
  • 66,732
  • 177
  • 439
  • 641
  • Depends on the context. Where does the code reside? Inside a string? – Lightness Races in Orbit Apr 02 '11 at 23:04
  • yes, it's php string variable – Alex Apr 02 '11 at 23:05
  • 2
    @Alexandra this is tough, because you'd need to parse the structure first to tell apart the parts you need to entity from those you don't. Why is it mixed that way in the first place? Can you influence how this is generated? – Pekka Apr 02 '11 at 23:07
  • i can't.. this is the output when a visitor posts a comment and I want to be able for them to post html too – Alex Apr 02 '11 at 23:09
  • @Alexandra You can't just let visitors post HTML to your site — this enables XSS attacks and allows bots to post really nasty spam that is invisible to regular visitors, but visible to search engine bots. – Kornel Apr 02 '11 at 23:16
  • I am sure it is HE.. sure he can, he will change it once he gets hacked. – Dejan Marjanović Apr 02 '11 at 23:57
  • @webarto: I'm not a "he". how can I be hacked if the code is converted to entities, and after that I strip tags from the entire comment (except a few like ``, `` etc.)? – Alex Apr 03 '11 at 00:15
  • After you finish, feel free to post a link. You can use `` and then convert it to `
    ` tag, it is shorter. Apologies for "he", not my business.
    – Dejan Marjanović Apr 03 '11 at 00:17
  • @Alexandra: The problem is that you are listing tags that _cannot_ be used, rather than listing tags that _can_ be used. – Lightness Races in Orbit Apr 03 '11 at 01:20
  • @Tomalak, but I'm running a strip tag function on the entire comment (after the htmlentity thing) that will remove any tags that are not in a $allowed variable (which are only a few) – Alex Apr 03 '11 at 02:05

4 Answers4

2

You can use jquery. This will encode anything inside any tags with a class code.

$(".code").each(
    function () {
        $(this).text($(this).html()).html();
    }
);

The fiddle: http://jsfiddle.net/mazzzzz/qnbLL/

Jess
  • 8,628
  • 6
  • 49
  • 67
  • +1 I'd recommend this approach as long as the result HTML is not insecure. – Christian Apr 02 '11 at 23:36
  • This question is about PHP, not Javascript. – Lightness Races in Orbit Apr 02 '11 at 23:39
  • but how can you get hacked if the code is escaped? stackoverflow does the same thing.... – Alex Apr 03 '11 at 00:10
  • 1
    @Alexandra: No, it doesn't. Stack Overflow accepts a strict subset of HTML. You _reject_ a strict subset of HTML. – Lightness Races in Orbit Apr 03 '11 at 13:54
  • @Tomalak but i only do that to the stuff inside CODE. on the rest of the comment I'm stripping tags just like SO – Alex Apr 03 '11 at 14:35
  • @Tomalak: and to the stuff inside code i'm doing the html entity thing, so there's no way a html tag can pass, right? ps: here's what id did in the end: http://stackoverflow.com/questions/5527574/simple-bbparser-in-php-that-lets-you-replace-content-outside-tags – Alex Apr 03 '11 at 14:41
2

PHP

if(preg_match_all('#\<(code|pre) class\=\"php\"\>(.*?)\</(code|pre)\>#is', $html, $code)){
    unset($code[0]);
    foreach($code as $array){
        foreach($array as $value){
            $html = str_replace($value, htmlentities($value, ENT_QUOTES), $html);
        }
    }
}

HTML

<code class="php"> &lt;div&gt; a div.. &lt;/div&gt; </code>

<pre class="php">
&lt;div&gt; a div.. &lt;/div&gt;
</pre>

<div> this should be ignored </div>

Have you ever heard of BB code? http://en.wikipedia.org/wiki/BBCode

Dejan Marjanović
  • 19,244
  • 7
  • 52
  • 66
2

OK, I've been playing with this for a while. The result may not be the best or most direct solution (and, frankly, I disagree with your approach entirely if arbitrary users are going to be submitting the input), but it appears to "work". And, most importantly, it doesn't use regexes for parsing XML. :)

Faking the input

<?php

$str = <<<EOF
<code class="php"> <div> a div.. </div> </code>

<pre class="php">
<div> a div.. </div>
</pre>

<div> this should be ignored </div>
EOF;

?>

Code

<?php

function recurse(&$doc, &$parent) {
   if (!$parent->hasChildNodes())
      return;

   foreach ($parent->childNodes as $elm) {

      if ($elm->nodeName == "code" || $elm->nodeName == "pre") {
         $content = '';
         while ($elm->hasChildNodes()) { // `for` breaks the `removeChild`
             $child = $elm->childNodes->item(0);
             $content .= $doc->saveXML($child);
             $elm->removeChild($child);
         }
         $elm->appendChild($doc->createTextNode($content));
      }
      else {
         recurse($doc, $elm);
      }
   }
}

// Load in the DOM (remembering that XML requires one root node)
$doc = new DOMDocument();
$doc->loadXML("<document>" . $str . "</document>");

// Iterate the DOM, finding <code /> and <pre /> tags:
recurse($doc, $doc->documentElement);

// Output the result
foreach ($doc->childNodes->item(0)->childNodes as $node) {
   echo $doc->saveXML($node);
}

?>

Output

<code class="php"> &lt;div&gt; a div.. &lt;/div&gt; </code>

<pre class="php">
&lt;div&gt; a div.. &lt;/div&gt;
</pre>

<div> this should be ignored </div>

Proof

You can see it working here.

Note that it doesn't explicitly call htmlspecialchars; the DOMDocument object handles the escaping itself.

I hope that this helps. :)

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
1

This is related somewhat, you do not have to use Geshi, but I wrote a bit of code here Advice for implementing simple regex (for bbcode/geshi parsing) that would help you with the problem.

It can be tweaked to not use GeSHi, just would take a bit of tinkering. Hope it helps ya.

Community
  • 1
  • 1
Jim
  • 18,673
  • 5
  • 49
  • 65