Regex to match any / in HTML except for
tag

Question

Basically i need to match any / from a HTML that isn't part of a closed <p> tag. This is what i got so far, but it doesn't really work as expected and I've been trying for some time now.

((?<!(p))\/(?!(>))) | ((?<!(<))\/(?!(p)))

I also need the regex to work in Java.

As an example:

<div>test</div> <span>test</span> <p>something<p/> </p>

I would like it to match every / except for the ones in the <p> tags at the end!

Refer to previous SO question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — fred02138, Oct 08 '13 at 13:45
just to clarify, you just want to match "/" a forward slash? and it must not be a slash thats closing an html tag. — string.Empty, Oct 08 '13 at 13:47
I would highly suggest using one of the many readily [available parsers for Java](http://goo.gl/Les6Qk). — Buggabill, Oct 08 '13 at 14:00

score 0 · Answer 1 · answered Oct 08 '13 at 13:59

0

/(?!p)

This seems to work. but im not sure what the question is.

<div>test</div> <span>test</span> <p>something<p/> </p>
matches:  /                /                    /

answered Oct 08 '13 at 13:59

string.Empty

10,393
4
39
67

Not quite: this successfully removes the match for `
`, but not the match for ``. – Ethan Brown Oct 08 '13 at 14:00
It seems to me that the OP want to match `` but not `
`. – sp00m Oct 08 '13 at 14:03
Read the op's first line: `Basically i need to match any / from a HTML that isn't part of a closed
tag.` therefore it must match all slashes except `
` – string.Empty Oct 08 '13 at 14:05

Ethan Brown · Accepted Answer · 2013-10-08T14:16:43.973

Fortunately, Java supports both lookbehind and lookahead (in contrast, the language I spend most of my time in, JavaScript, supports only lookahead).

So the pattern you're looking for is:

(?<!<p)/(?!p>)

This pattern will match any slash that's neither preceded by a <p or followed by a p>. Therefore it excludes <p/> as well as </p>.

The lookahead/lookbehind assertions (often called "zero-width" assertions) are not actually included in the match, which sounds like what you want. It basically asserts that the thing you are trying to match is preceded by (lookbehind) or followed by (lookahead) a sub-expression. In this case we're using negative assertions (not preceded by / not followed by).

Parsing HTML with regex is a trikcy business. As one answer pointed out, HTML is context-free, and therefore cannot be completely parsed by HTML, leaving open the possibility of HTML that will confound the match. Let's not even get started on ill-formed HTML.

I would consider the following common variation on an empty tag, though:

<p />

To handle this, I would add some whitespace to the match:

(?<!<p\s*)/(?!p>)

Where you might run into problems is weird whitespace (still valid HTML). The following slashes WILL match with the above regex:

< p/>
<p/ >

This can be dealt with by adding more whitespace reptitions to your regex. As mentioned before, this will also match slashes in text, so the following input will match only one slash (the one in the text):

<p>some text / other text</p>

Lastly, of course, there are CDATA groups. The following input will match NO slashes:

<![CDATA[This <p/> isn't actually a tag...it's just text.]]>

Regex to match any / in HTML except for tag

2 Answers2

Regex to match any / in HTML except for
tag