The main challenge is that since awk only supports greedy matching you can't write any variation of <foo>.*</foo>
that will stop at the first </foo>
on the line instead of the last </foo>
. The solution is to convert each start and end string into a single character that cannot appear in the input so you can write x[^xy]*y
where x and y are those start/end characters but how do you choose a character that can't appear in the input? You don't - you make one:
$ cat nonGreedy.awk
{
$0 = encode($0)
while ( match($0,/({[^{}]*})/) ) {
print decode(substr($0,RSTART,RLENGTH))
$0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
function encode(str) {
gsub(/@/,"@A",str)
gsub(/{/,"@B",str); gsub(/}/,"@C",str)
gsub(/<foo>/,"{",str); gsub(/<\/foo>/,"}",str)
return str
}
function decode(str) {
gsub(/}/,"</foo>",str); gsub(/{/,"<foo>",str)
gsub(/@C/,"}",str); gsub(/@B/,"{",str)
gsub(/@A/,"@",str)
return str
}
$ awk -f nonGreedy.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1 .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@ .. </foo>
<foo> .. 5 .. </foo>
The above works by you picking any character that can't appear JUST IN THE START/END STRINGS (note it doesn't have to be a character that can't appear in the input at all, just not in those strings), in this case I'm choosing @
, and appending an A
after each occurrence of it in the input. At this point every occurrence of @A
represents an @
character and there are guaranteed to be no occurrences of @B
or @
followed by anything else anywhere in the input.
Now we can pick 2 other characters that we want to use to represent the start/end strings, in this case I'm choosing {
and }
, and convert them to some @
-prefixed strings like @B
and @C
and at this point every occurrence of @B
represents a {
character and @C
represents a }
character and there are no {
s or }
s anywhere in the input.
Now all that's left to do to find the strings we want to extract is convert every start string <foo>
to the start character we've chosen, {
, and every end string </foo>
to the end character }
and then we can use a simple regexp of {[^{}]*}
to represent a non-greedy version of <foo>.*</foo>
.
As we find each string we just unwind the conversions we did above in reverse order (note you must unwind the substitutions to each matched string in exactly the reverse order you applied them to the whole record) so {
goes back to <foo>
and @B
goes back to {
, and @A
goes back to @
, etc. and we have the original text for that string.
The above will work in any awk. If your start/end strings contain RE metacharacters then you'd have to escape those or use a while(index(substr()))
loop instead of gsub()
to replace them.
Note that if you do use gawk and the labels aren't nested then you can keep the 2 functions exactly as above and change the rest of the script to just:
BEGIN { FPAT="{[^{}]*}" }
{
$0 = encode($0)
for (i=1; i<=NF; i++) {
print decode($i)
}
}
Obviously you don't really need to put the encode/decode functionality in separate functions, I just separated that out here to make that functionality explicit and separate from the loop that uses it for clarity.
For another example of when/how to apply the above approach, see https://stackoverflow.com/a/40540160/1745001.