-1

What I'm trying to do is find all the matches within a content block, but ignore anything that is inside tags, for use inside preg_replace_callback().

For example:

test
<a href="test.com">test title</a>
test

In this case, I want the first line to match, and the third line to match, but NOT the url match, nor the title match in between the a tags.

I've got a regex that I feel like is close:

#(?!<.*?)(\btest\b)(?![^<>]*?>)#si

(and this will not match the url part)

But how do I modify the regex to also exclude the "test" between a and /a?

Ben
  • 1
  • 2
  • 1
    `and the fourth line to match` Erm, you only have three lines in your input? – CertainPerformance Oct 20 '18 at 21:52
  • 2
    Do you have to account for nested tags as well? Eg `testtesttest`, or self-closing tags? Sounds like a job for something that's *not* a regular expression (HTML and regex generally do not work well together) – CertainPerformance Oct 20 '18 at 21:55
  • HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. See: http://php.net/manual/en/class.domdocument.php – Toto Oct 21 '18 at 08:41
  • It doesn't use nested tags, and unfortunately due to the application I have to use regex, but I appreciate the thoughtful question and suggestion. – Ben Oct 21 '18 at 12:32

2 Answers2

0

If it's always the same pattern you can use [A-Z] or a combination like [A-Za-z]

Jake
  • 21
  • 7
0

I ended up solving it myself. This regex pattern will do what I wanted:

#(?!<a[^>]*?>)(\btest\b)(?![^<]*?<\/a>)#si
Ben
  • 1
  • 2