0

I have a xml file in following format

<starttag name="AAA" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
<starttag name="BBB" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
</starttag>
<starttag name="CCC" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
..
..
..

I want to extract all those name attributes of starttag whose any of the innertag has value YYY.

So in the file above, the output will be AAA and CCC. I can only use regex matching. I suppose it is possible using lookaheads but not able to create regex patterns for multilines. I know how to use regex for single line and I tried using same with this also but not getting expected outputs. Anyone any headway on this.

Edit: Though I have put xml example but actually I am trying to get to know multiline regex matching and I am trying on this file which I am failing. Please avoid XML parsing related solutions.

Update: As per Steven suggestion, following worked

pcregrep -M '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml

grep -Pzo '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml
Shashwat Kumar
  • 5,159
  • 2
  • 30
  • 66

2 Answers2

1

Consider using XMLStarlet

"XMLStarlet is a set of command line utilities (tools) which can be used to transform, query, validate, and edit XML documents and files using simple set of shell commands in similar way it is done for plain text files using UNIX grep, sed, awk, diff, patch, join, etc commands."

neuhaus
  • 3,886
  • 1
  • 10
  • 27
0

An XML parser, especially one which supports XPath is going to be far easier and more stable, but if you really must insist on using regex, here's a pattern that will work with the sample input that you provided:

<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>

It's not going to work with all variations of well-formed XML documents, but as long as they are consistently formatted like your example, you should be "okay".

By default, regex always captures across multiple lines. There is an option where you can tell it to only process one line at a time, but that's not usually turned on by default. The only real trick to it is that the . pattern does not match new-line characters, so if you want to match any character, including new-lines, you need to use .|\n or a negative character class such as [^>].

Steven Doggart
  • 43,358
  • 8
  • 68
  • 105