0

How to replace specific text in link, but skip this text that already in links?

Example:

<a href="helloworld.com">Lorem ipsum dolor sit amet</a>, consectetur
adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore
magna aliqua. Lorem ipsum dolor sit amet, consectetur <a
href="adipisicing.com">adipisicing</a> elit, sed do eiusmod tempor
incididunt ut labore et dolore <a href="helloworld.com">magna aliqua.
Lorem ipsum</a> dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua.

As you see, I need to replace "Lorem ipsum" to <a href="somewhere.com">Lorem ipsum</a> in the second statement, but skip "Lorem ipsum" that already in links.

Thanks!

RKI
  • 433
  • 2
  • 5
  • 10
  • Please use the search beforehand. Please also explain how you came to the conclusion to use a regex for that (did you try anything yet?), or if that's an actual constraint. – mario Nov 25 '11 at 07:19
  • I used search, but not found anything like I need. I tried, but my expression replaced text that already in links. – RKI Nov 25 '11 at 07:22
  • possible duplicate of [PHP Regular expression to match keyword outside HTML tag ](http://stackoverflow.com/questions/7798829/php-regular-expression-to-match-keyword-outside-html-tag-a) – mario Nov 25 '11 at 07:23
  • I tried this: `$data = preg_replace( '/(keyword[^\.\,\:\s]*)/iu', '$1', $data );` and `$data = preg_replace( '#(?<![">])(keyword[^\.\,\:\s]*)#iu', '$1', $data );` Second regexp skip text only if he after `` – RKI Nov 25 '11 at 07:30
  • @Roman - I would like to discourage you from using regex on HTML directly. I would suggest a parser instead. It might seem straight forward to use regex since your data set seems small and it is probably fairly homogenous. However, investing your time into implementing a parser will be well worth it since you never know how your HTML might change later... and then you will have to eventually go and implement a parser anyway. – Dimitar Dimitrov Nov 25 '11 at 07:44

1 Answers1

4

Regular expressions are not very well suited to deal with HTML. Every solution you have will fail miserably on comments, embedded javascript or malformed HTML.

That said, if you strictly control the structure of your documents, you can try the regex approach. To match every "Lorem ipsum" not inside an a tag, I'd use

Lorem ipsum(?=([^<]*($|<a |<[^/]|</[^a]))*($|(?<=a )))

This statement uses a look ahead assertion to match "Lorem ipsum" if it is followed by a opening a tag before the next closing one, or no further a tags follow. See it in action at RegExr.

As you can see, it is probably better to use a HTML parser. =)

Jens
  • 25,229
  • 9
  • 75
  • 117