Bash: Regex matching on multiple lines simultaneously and extracting captured content

Question

I have a xml file in following format

<starttag name="AAA" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
<starttag name="BBB" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
</starttag>
<starttag name="CCC" >
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="XXX"/>
    <innertag name="XXX" value="YYY"/>
</starttag>
..
..
..

I want to extract all those name attributes of starttag whose any of the innertag has value YYY.

So in the file above, the output will be AAA and CCC. I can only use regex matching. I suppose it is possible using lookaheads but not able to create regex patterns for multilines. I know how to use regex for single line and I tried using same with this also but not getting expected outputs. Anyone any headway on this.

Edit: Though I have put xml example but actually I am trying to get to know multiline regex matching and I am trying on this file which I am failing. Please avoid XML parsing related solutions.

Update: As per Steven suggestion, following worked

pcregrep -M '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml

grep -Pzo '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml

Well, since that would be both easier and more stable, can you say why you don't want to? — Steven Doggart, Jan 28 '16 at 13:27
Writing one line regex is always easier that big code if possible and I will also get to know using multiline regex matching with it. — Shashwat Kumar, Jan 28 '16 at 13:31

score 1 · Answer 1 · answered Jan 28 '16 at 13:31

Consider using XMLStarlet

"XMLStarlet is a set of command line utilities (tools) which can be used to transform, query, validate, and edit XML documents and files using simple set of shell commands in similar way it is done for plain text files using UNIX grep, sed, awk, diff, patch, join, etc commands."

score 0 · Accepted Answer · answered Jan 28 '16 at 13:35

An XML parser, especially one which supports XPath is going to be far easier and more stable, but if you really must insist on using regex, here's a pattern that will work with the sample input that you provided:

<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="YYY"\/>(\s|<innertag[^>]*>)*<\/starttag>

It's not going to work with all variations of well-formed XML documents, but as long as they are consistently formatted like your example, you should be "okay".

By default, regex always captures across multiple lines. There is an option where you can tell it to only process one line at a time, but that's not usually turned on by default. The only real trick to it is that the . pattern does not match new-line characters, so if you want to match any character, including new-lines, you need to use .|\n or a negative character class such as [^>].

`.` doesn't match new-line was the part I was ignorant of. Thanks for mentioning it. — Shashwat Kumar, Jan 28 '16 at 13:52
@ShashwatKumar yeah, it took me a long time before I realized that. Once I did, everything started making much more sense :) — Steven Doggart, Jan 28 '16 at 13:54

Bash: Regex matching on multiple lines simultaneously and extracting captured content

2 Answers2