1

I'm replacing some text using a regex that I've found here.

$items = array(
  ':)'   => 'smile',
  ':('   => 'sad',
  '=))'  => 'laugh',
  ':p'   => 'tongue',      
); 

foreach($items as $key => $class)
  $regex[] = preg_quote($key, '#');

$regex = '#(?!<\w)('.implode('|', $regex).')(?!\w)#';

$string = preg_replace_callback($regex, function($matches) use($items){

  if(isset($items[$matches[0]])) 
    return '<span class="'.$items[$matches[0]].'">'.$matches[0].'</span>';

  return $matches[0];

}, $string);

It works but how can ignore matches within HTML tag definitions (like within tag attributes) ?

For example:

$string = 'Hello :) <a title="Hello :)"> Bye :( </a>';

=> The second :) should be ignored.

Community
  • 1
  • 1
Alex
  • 66,732
  • 177
  • 439
  • 641
  • 1
    The simple answer, just like every time when HTML parsing is involved, is: Don't use regex. – Tomalak May 22 '12 at 16:58
  • but there aren't any good HTML parsers for PHP :( There's the DOM extension, but let's face it, it sucks.. – Alex May 22 '12 at 17:01
  • 2
    @Alex wait. wut??? You think DOMDocument? sucks, but you are using regex? – PeeHaa May 22 '12 at 17:03
  • 1
    @Alex You decided to use Regex to solve problem. Now you have 2 problems. – Robik May 22 '12 at 17:09
  • 1
    [PHP DOMDocument](http://php.net/manual/en/class.domdocument.php) can do what you require. [Search](http://stackoverflow.com/search?q=dom+parser+%5Bphp%5D&submit=search) StackOverflow for related questions, or read the documentation. Edit: You said DOMDocument sucks, but are (attempting) to use RegEx to solve your problem. Sorry, I can't help you further. A developer is only as good as the tools he (or she) utilizes and understands. – PenguinCoder May 22 '12 at 17:09

2 Answers2

1

Pre-filter your input string first. Clean up any smileys within HTML tags:

$regex = '#<[^>]+('.implode('|', $regex).')[^>]+>#';

and run your code above.

flowfree
  • 16,356
  • 12
  • 52
  • 76
1

Here's a DOMDocument-based implementation that does a by-the-book string replacement for your HTML:

$string = 'Hello :) <a title="Hello :)"> Bye :( </a>';

$items = array(
  ':)'   => 'smile',
  ':('   => 'sad',
  '=))'  => 'laugh',
  ':p'   => 'tongue',      
); 

foreach($items as $key => $class) $regex[] = preg_quote($key);

$regex = '#(?!<\w)('.implode('|', $regex).')(?!\w)#';

$doc = new DOMDocument();
$doc->loadHTML($string);

$xp = new DOMXPath($doc);

$text_nodes = $xp->query('//text()');

foreach ($text_nodes as $text_node)
{
  $parent  = $text_node->parentNode;
  $context = $text_node->nextSibling;
  $text    = $text_node->nodeValue;
  $matches = array();
  $offset  = 0;

  $parent->removeChild($text_node);

  while ( preg_match($regex, $text, $matches, PREG_OFFSET_CAPTURE, $offset) > 0 )
  {
    $match  = $matches[0];
    $smiley = $match[0];
    $pos    = $match[1];
    $prefix = substr($text, $offset, $pos - $offset);
    $offset = $pos + strlen($smiley);

    $span = $doc->createElement('span', $smiley);
    $span->setAttribute('class', $items[$smiley]);

    $parent->insertBefore( $doc->createTextNode($prefix), $context );
    $parent->insertBefore( $span, $context );
  }

  $suffix = substr($text, $offset);
  $parent->insertBefore( $doc->createTextNode($suffix), $context );
}

$body = $doc->getElementsByTagName('body');
$html = $doc->saveHTML( $body[0] );

Wrap it in a function and you're good to go. It may be more lines of code than regex, but it's not an ugly, bug-ridden maintenance nightmare (like any regex-based solution would be).

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • thanks, i`ll go with DOMDocument.. – Alex May 22 '12 at 20:18
  • @Alex I've not tested the code above. Please fix errors in my answer if you uncover some. – Tomalak May 22 '12 at 20:20
  • I will, but right now I'm trying to figure out how to do this with phpQuery which is a jquery-like interface for DOMdocument. – Alex May 22 '12 at 20:37
  • @Alex That's not a bad idea. I did not think of phpQuery. – Tomalak May 22 '12 at 20:41
  • That answer does not deal with the case that the smiley might be interrupted with a comment or tag, see http://stackoverflow.com/questions/8193327/ignore-html-tags-in-preg-replace - @Alex: For phpQuery, it's based on DOMDocument, too, so you can combine the two. – hakre May 22 '12 at 22:23
  • @hakre The answer probably does not deal with a couple of other corner cases, either. \*shrugs\* – Tomalak May 22 '12 at 22:24
  • Nah, just saying and a little cross linking. My TextRange class does not have yet some direct regex access, too (which would be cool). – hakre May 22 '12 at 22:25
  • I ended up using FluentDOM, it's a little more flexible than phpQuery. Anyway your code works, the only mistake is @ `$span = $doc->createElement('span', $items[$smiley]);` the content of the span should be `$smiley` – Alex May 23 '12 at 12:00
  • @Alex Thanks for that, I corrected it. It would be great if you could share the code you ended up using, I'd like to have a look (and potentially, others might benefit in the future). – Tomalak May 23 '12 at 12:17
  • well it's pretty much the same as your code: http://codepad.org/SADkI0sS. But I had to modify the each() function of fluentDom to pass the documentobject too, as I haven't figured it out how to get it from the callback function yet... – Alex May 23 '12 at 18:20
  • @Alex: Every `DOMNode` has an `ownerDocument` property. You could use that. – Tomalak May 23 '12 at 19:12
  • ah yes, it was the ownerDocument property of `$el`, thanks :) – Alex May 23 '12 at 19:17