Fortunately, Java supports both lookbehind and lookahead (in contrast, the language I spend most of my time in, JavaScript, supports only lookahead).
So the pattern you're looking for is:
(?<!<p)/(?!p>)
This pattern will match any slash that's neither preceded by a <p
or followed by a p>
. Therefore it excludes <p/>
as well as </p>
.
The lookahead/lookbehind assertions (often called "zero-width" assertions) are not actually included in the match, which sounds like what you want. It basically asserts that the thing you are trying to match is preceded by (lookbehind) or followed by (lookahead) a sub-expression. In this case we're using negative assertions (not preceded by / not followed by).
Parsing HTML with regex is a trikcy business. As one answer pointed out, HTML is context-free, and therefore cannot be completely parsed by HTML, leaving open the possibility of HTML that will confound the match. Let's not even get started on ill-formed HTML.
I would consider the following common variation on an empty tag, though:
<p />
To handle this, I would add some whitespace to the match:
(?<!<p\s*)/(?!p>)
Where you might run into problems is weird whitespace (still valid HTML). The following slashes WILL match with the above regex:
< p/>
<p/ >
This can be dealt with by adding more whitespace reptitions to your regex. As mentioned before, this will also match slashes in text, so the following input will match only one slash (the one in the text):
<p>some text / other text</p>
Lastly, of course, there are CDATA groups. The following input will match NO slashes:
<![CDATA[This <p/> isn't actually a tag...it's just text.]]>
This text / that text
`? – Ethan Brown Oct 08 '13 at 13:54something
` the `` is text. – string.Empty Oct 08 '13 at 13:56