0

The following regular expression found here matched the below domains just fine, but I don't want it to match a domain that is already in an anchor tag (the last example) Note that this matching will be detecting these cases in sentences of text.

((?: http| https)://)?[.0-9a-z-]+\.[a-z]{2,6}(?::[0-9]{1,5}+)?(?:/[!$'()*+,._a-z-]++){0,9}(?:/[!$'()*+,._a-z-]*)?(?:\?[!$&'()*+,.=_a-z-]*)?

Matches this in a sentence or paragraph:

www.domain.com
domain.com
this.is.a.special.url.domain.com/hello 
http://domain.com
http://www.domain.com
http://www.domain.com/
http://www.domain.com/index.html
http://www.domain.com/index.html?source=library

BUT, how do I change the regex to not match the domain that is already in an anchor tag?

<a href="http://www.usertesting.com">hello</a>
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
PeppyHeppy
  • 1,345
  • 12
  • 20
  • **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Sep 06 '13 at 01:30
  • @AndyLester, thanks, but I am not parsing html, I am skipping html and only looking for non-html urls. – PeppyHeppy Sep 06 '13 at 04:04
  • 1
    I understand that, and identifying which part of the file is markup and which is text is, indeed, parsing HTML. – Andy Lester Sep 06 '13 at 04:58

1 Answers1

1

You can just add negative lookbehind to exclude matches which follows a href=" or href=' like this:

(?<!href=["'])((?: http| https)://)?[.0-9a-z-]+\.[a-z]{2,6}(?::[0-9]{1,5}+)?(?:/[!$'()*+,._a-z-]++){0,9}(?:/[!$'()*+,._a-z-]*)?(?:\?[!$&'()*+,.=_a-z-]*)?
justhalf
  • 8,960
  • 3
  • 47
  • 74