5

This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.

I'm trying to find all the nested tags, here are some examples I want to catch: <a><a></a></a>

I don't want to catch <a></a><a></a>

So in plain english I want to catch all <a> following other <a> without having </a> in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak

Hoping to have this problem solved. Thanks all!

Jerry
  • 70,495
  • 13
  • 100
  • 144
Gugg
  • 240
  • 3
  • 12
  • 3
    Welcome on Stackoverflow! Questions should show the poster's own efforts - that is what have you tried, what didn't work, what have you researched and why does it not help you? In addition to htat, as per the tag wiki, please always provide the programming language/environment you are using, when asking regex questions. Regarding your actual question, already this "simple" problem can become arbitrarily complex: `More info about links`. Add XML comments to that and you see where this is going. Consider an XML/HTML parser. – Martin Ender Aug 10 '13 at 19:38

2 Answers2

7

I hope you are ready for parsing XML with regex.


First of all, let's define what XML tags would look like!

<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>

To match one of these tags we can then use the following regex:

/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s

Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:

/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s

Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):

/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s

Done - The regex should do.

No seriously, try it out.

I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml from aliteralmind's question and tried it with my regex as well. Works best with heavily nested elements.

img
(source: gyazo.com)

Cheers.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Unihedron
  • 10,902
  • 13
  • 62
  • 72
2

If you want a 100% correct solution, for example one that works with arbitrary content in comments and CDATA sections and in internal/external entities, and with author-chosen namespace prefixes, then it can't be done with regular expressions.

And since a 100% correct solution is very easy to achieve with XSLT, I think you are using the wrong technology.

No doubt you can achieve an acceptably high hit rate with regular expressions if you're prepared to put enough work in, but the details depend on aspects of the specification that you haven't made clear: for example, what you want to do with the nested elements that you find, and whether you want to locate elements nested 3-deep or 4-deep as well as those nested 2-deep.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thank you for pointing this out, that I also realized that a 100% solution is not possible with regex but...what about 90% completion with regex...because that is the only tool available to me at this time – Gugg Aug 16 '13 at 17:16
  • @Gugg: But the solution is so much easier and simpler using XPath. And why are you sure that regex is the only tool available to you. XPath ( v1 at least) is available from Perl, Ruby, Python, PHP, Javascript, and from the unix shell ( the perl xpath script using perl XML::XPATH library based comes installed on Mac OS ). What tools are available to you ? – Steven D. Majewski Aug 16 '13 at 19:18