0

I have following input string

Testing <B><I>bold italic</I></B> text. 

and following regex :

<([A-Z][A-Z0-9]*)\b[^>]*>.*</\1>

This regex only gives following larger match

<B><I>bold italic</I></B>

How to use regex to get the smaller match ?

<I>bold italic</I>

I tried using non-greedy operators, but it didn't worked either.

And Is it possible to get both as match groups using like java or c# match groups or match collections ?

2 Answers2

1

Try the below regex which uses positive lookbehind,

(?<=>)<([A-Z][A-Z0-9]*)\b[^>]*>.*<\/\1>

DEMO

It looks for the tag which starts just after to the > symbol.

Explanation:

  • (?<=>) Positive lookbehind is used here, which sets the matching marker just after tp the > symbol.
  • < Literal < symbol.
  • ([A-Z][A-Z0-9]*\b[^>]*>) Captures upto the next > symbol.
  • .* Matches nay character except \n zero or more times.
  • <\/\1> Matches the lietral </+first captured group+>
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Is it possible to loop through both the matches using single regex? In any programming language. – user3839434 Jul 15 '14 at 06:44
  • yep.Most of the programming languages supports lookbehind. In java, you need to escape the backslash one more time because the pattern is surrounded by double quotes instead of forward slash. – Avinash Raj Jul 15 '14 at 06:46
  • I tried to get all matches in c# and it returened only smaller one. Is is possible to get both matches one by one ? If you can show me java/C# code, it would be great. – user3839434 Jul 15 '14 at 06:55
  • post the input in this link http://regex101.com/r/wI6fK3/3 and save the regex. Then post back the link here. After that, explain what do you want to match on that input. Then i'll show you the c# code. – Avinash Raj Jul 15 '14 at 06:57
1

As you probably know, many people prefer using a DOM parser to parse html. But looking at your existing regex, to fix it, I would suggest this:

<([A-Z][A-Z0-9]*)\b[^<>]*>[^<]*</\1>

See the demo.

Explanation

  • Inside the tags, inside of the .* that match too many chars, we use [^<]*, which matches any chars that are not an opening tag. That way we won't go into another tag.
  • Likewise, I changed your [^>]* to [^<>]* so we don't start another tag
  • I assume you will make this case-insensitive
zx81
  • 41,100
  • 9
  • 89
  • 105