Regex to not match inside html anchor tag

Question

I have a requirement where I don't have to match a specific word when in occurs between anchor tag. Anchor tags can have other html tags nested.

For Example:

    <a title="Test" href="http://www.google.com/"><span style="color: blue;">Test</span></a><p>Test - MANUALLY<br /><br /><a href="http://www.google.com">Google</a>&nbsp;</p><p> Resolving as duplicate of Test</p><p>Test  test</p>

Here every "Test" gets selected. All I want here is getting only "Test" not present inside "anchor tag" and also not part of attributes of "anchor tag".

Regex I used was:

    (?!<a[^>]*>)(Test)(?![^<]*<\/a>)/gi

You would need some kind of SAX parser to know when an open anchor tag starts. Start by examining text content for what you want to find. When you get an open anchor tag, ignore text content that pours in. Wait for a closing anchor, then resume search on text content that comes in. — , May 05 '17 at 17:57
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 — melpomene, May 05 '17 at 19:10

score 2 · Answer 1 · answered May 05 '17 at 19:06

Not sure if this will accomplish your needs, but the second capturing group should only include matches that do not fall within the anchor tag.

(<a.*?<\/a>)|(test)/gi

https://regex101.com/r/rTLifk/1

However, I would highly recommend utilizing an XML parser or XPath.

Regex to not match inside html anchor tag

1 Answers1