Ignore matches from HTML tag defintions

Question

I'm replacing some text using a regex that I've found here.

$items = array(
  ':)'   => 'smile',
  ':('   => 'sad',
  '=))'  => 'laugh',
  ':p'   => 'tongue',      
); 

foreach($items as $key => $class)
  $regex[] = preg_quote($key, '#');

$regex = '#(?!<\w)('.implode('|', $regex).')(?!\w)#';

$string = preg_replace_callback($regex, function($matches) use($items){

  if(isset($items[$matches[0]])) 
    return '<span class="'.$items[$matches[0]].'">'.$matches[0].'</span>';

  return $matches[0];

}, $string);

It works but how can ignore matches within HTML tag definitions (like within tag attributes) ?

For example:

$string = 'Hello :) <a title="Hello :)"> Bye :( </a>';

=> The second :) should be ignored.

The simple answer, just like every time when HTML parsing is involved, is: Don't use regex. — Tomalak, May 22 '12 at 16:58
but there aren't any good HTML parsers for PHP :( There's the DOM extension, but let's face it, it sucks.. — Alex, May 22 '12 at 17:01
@Alex wait. wut??? You think DOMDocument? sucks, but you are using regex? — PeeHaa, May 22 '12 at 17:03
@Alex You decided to use Regex to solve problem. Now you have 2 problems. — Robik, May 22 '12 at 17:09
[PHP DOMDocument](http://php.net/manual/en/class.domdocument.php) can do what you require. [Search](http://stackoverflow.com/search?q=dom+parser+%5Bphp%5D&submit=search) StackOverflow for related questions, or read the documentation. Edit: You said DOMDocument sucks, but are (attempting) to use RegEx to solve your problem. Sorry, I can't help you further. A developer is only as good as the tools he (or she) utilizes and understands. — PenguinCoder, May 22 '12 at 17:09

score 1 · Answer 1 · answered May 22 '12 at 17:28

1

Pre-filter your input string first. Clean up any smileys within HTML tags:

$regex = '#<[^>]+('.implode('|', $regex).')[^>]+>#';

and run your code above.

answered May 22 '12 at 17:28

flowfree

16,356
12
52
76

Tomalak · Accepted Answer · 2012-05-23T12:16:18.450

1

Here's a DOMDocument-based implementation that does a by-the-book string replacement for your HTML:

$string = 'Hello :) <a title="Hello :)"> Bye :( </a>';

$items = array(
  ':)'   => 'smile',
  ':('   => 'sad',
  '=))'  => 'laugh',
  ':p'   => 'tongue',      
); 

foreach($items as $key => $class) $regex[] = preg_quote($key);

$regex = '#(?!<\w)('.implode('|', $regex).')(?!\w)#';

$doc = new DOMDocument();
$doc->loadHTML($string);

$xp = new DOMXPath($doc);

$text_nodes = $xp->query('//text()');

foreach ($text_nodes as $text_node)
{
  $parent  = $text_node->parentNode;
  $context = $text_node->nextSibling;
  $text    = $text_node->nodeValue;
  $matches = array();
  $offset  = 0;

  $parent->removeChild($text_node);

  while ( preg_match($regex, $text, $matches, PREG_OFFSET_CAPTURE, $offset) > 0 )
  {
    $match  = $matches[0];
    $smiley = $match[0];
    $pos    = $match[1];
    $prefix = substr($text, $offset, $pos - $offset);
    $offset = $pos + strlen($smiley);

    $span = $doc->createElement('span', $smiley);
    $span->setAttribute('class', $items[$smiley]);

    $parent->insertBefore( $doc->createTextNode($prefix), $context );
    $parent->insertBefore( $span, $context );
  }

  $suffix = substr($text, $offset);
  $parent->insertBefore( $doc->createTextNode($suffix), $context );
}

$body = $doc->getElementsByTagName('body');
$html = $doc->saveHTML( $body[0] );

Wrap it in a function and you're good to go. It may be more lines of code than regex, but it's not an ugly, bug-ridden maintenance nightmare (like any regex-based solution would be).

edited May 23 '12 at 12:16

answered May 22 '12 at 20:10

Tomalak

332,285
67
532
628

thanks, i`ll go with DOMDocument.. – Alex May 22 '12 at 20:18
@Alex I've not tested the code above. Please fix errors in my answer if you uncover some. – Tomalak May 22 '12 at 20:20
I will, but right now I'm trying to figure out how to do this with phpQuery which is a jquery-like interface for DOMdocument. – Alex May 22 '12 at 20:37
@Alex That's not a bad idea. I did not think of phpQuery. – Tomalak May 22 '12 at 20:41
That answer does not deal with the case that the smiley might be interrupted with a comment or tag, see http://stackoverflow.com/questions/8193327/ignore-html-tags-in-preg-replace - @Alex: For phpQuery, it's based on DOMDocument, too, so you can combine the two. – hakre May 22 '12 at 22:23
@hakre The answer probably does not deal with a couple of other corner cases, either. \*shrugs\* – Tomalak May 22 '12 at 22:24
Nah, just saying and a little cross linking. My TextRange class does not have yet some direct regex access, too (which would be cool). – hakre May 22 '12 at 22:25
I ended up using FluentDOM, it's a little more flexible than phpQuery. Anyway your code works, the only mistake is @ `$span = $doc->createElement('span', $items[$smiley]);` the content of the span should be `$smiley` – Alex May 23 '12 at 12:00
@Alex Thanks for that, I corrected it. It would be great if you could share the code you ended up using, I'd like to have a look (and potentially, others might benefit in the future). – Tomalak May 23 '12 at 12:17
well it's pretty much the same as your code: http://codepad.org/SADkI0sS. But I had to modify the each() function of fluentDom to pass the documentobject too, as I haven't figured it out how to get it from the callback function yet... – Alex May 23 '12 at 18:20
@Alex: Every `DOMNode` has an `ownerDocument` property. You could use that. – Tomalak May 23 '12 at 19:12
ah yes, it was the ownerDocument property of `$el`, thanks :) – Alex May 23 '12 at 19:17

Ignore matches from HTML tag defintions

2 Answers2