Regex get all matches including smaller submatches

Question

I have following input string

Testing <B><I>bold italic</I></B> text.

and following regex :

<([A-Z][A-Z0-9]*)\b[^>]*>.*</\1>

This regex only gives following larger match

<B><I>bold italic</I></B>

How to use regex to get the smaller match ?

<I>bold italic</I>

I tried using non-greedy operators, but it didn't worked either.

And Is it possible to get both as match groups using like java or c# match groups or match collections ?

Avinash Raj · Accepted Answer · 2014-07-15T05:59:01.950

1

Try the below regex which uses positive lookbehind,

(?<=>)<([A-Z][A-Z0-9]*)\b[^>]*>.*<\/\1>

It looks for the tag which starts just after to the > symbol.

Explanation:

(?<=>) Positive lookbehind is used here, which sets the matching marker just after tp the > symbol.
< Literal < symbol.
([A-Z][A-Z0-9]*\b[^>]*>) Captures upto the next > symbol.
.* Matches nay character except \n zero or more times.
<\/\1> Matches the lietral </+first captured group+>

edited Jul 15 '14 at 05:59

answered Jul 15 '14 at 05:44

Avinash Raj

Is it possible to loop through both the matches using single regex? In any programming language. – user3839434 Jul 15 '14 at 06:44
yep.Most of the programming languages supports lookbehind. In java, you need to escape the backslash one more time because the pattern is surrounded by double quotes instead of forward slash. – Avinash Raj Jul 15 '14 at 06:46
I tried to get all matches in c# and it returened only smaller one. Is is possible to get both matches one by one ? If you can show me java/C# code, it would be great. – user3839434 Jul 15 '14 at 06:55
post the input in this link http://regex101.com/r/wI6fK3/3 and save the regex. Then post back the link here. After that, explain what do you want to match on that input. Then i'll show you the c# code. – Avinash Raj Jul 15 '14 at 06:57

score 1 · Answer 2 · answered Jul 15 '14 at 05:47

1

As you probably know, many people prefer using a DOM parser to parse html. But looking at your existing regex, to fix it, I would suggest this:

<([A-Z][A-Z0-9]*)\b[^<>]*>[^<]*</\1>

Explanation

Inside the tags, inside of the .* that match too many chars, we use [^<]*, which matches any chars that are not an opening tag. That way we won't go into another tag.
Likewise, I changed your [^>]* to [^<>]* so we don't start another tag
I assume you will make this case-insensitive

answered Jul 15 '14 at 05:47

zx81

FYI, added demo and explanation. :) – zx81 Jul 15 '14 at 05:50
thanks , it's working . Btw, I am not doing this to parse HTML, I am doing it just to learn regex. – user3839434 Jul 15 '14 at 05:54

2 Answers2