How do I use awk to extract data within nested delimiters using non-greedy regexps

Question

This question occurs repeatedly in many forms with many different multi-character delimiters and so IMHO is worth a canonical answer.

Given an input file like:

<foo> .. 1 <foo> .. a<2 .. </foo> .. </foo> <foo> .. @{<>}@ <foo> .. 4 .. </foo> .. </foo> <foo> .. 5 .. </foo>

how do you extract the text between the nested start (<foo>) and end (</foo>) delimiters using non-greedy matching with awk?

Desired output (in any order) is:

<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

Note that start or end could be any multi-character string and the text between them could be anything except those strings, including characters which are part of those strings such as the < or > characters in this example.

whats your motivation on using awk instead of perl with extended regex for this kind of task? — nopasara, Nov 09 '16 at 19:05
awk comes with all UNIX installations while perl doesn't, is perfectly suited for and an excellent choice for all general text manipulation tasks, and has a syntax that most people with any background in any Algol-based language (C, Ada, Pascal, Shell, etc.) will find familiar and easy to understand. — Ed Morton, Nov 09 '16 at 22:30

score 1 · Accepted Answer · edited May 23 '17 at 11:59

The main challenge is that since awk only supports greedy matching you can't write any variation of <foo>.*</foo> that will stop at the first </foo> on the line instead of the last </foo>. The solution is to convert each start and end string into a single character that cannot appear in the input so you can write x[^xy]*y where x and y are those start/end characters but how do you choose a character that can't appear in the input? You don't - you make one:

$ cat nonGreedy.awk
{
    $0 = encode($0)
    while ( match($0,/({[^{}]*})/) ) {
        print decode(substr($0,RSTART,RLENGTH))
        $0 = substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
    }
}
function encode(str) {
    gsub(/@/,"@A",str)
    gsub(/{/,"@B",str); gsub(/}/,"@C",str)
    gsub(/<foo>/,"{",str); gsub(/<\/foo>/,"}",str)
    return str
}
function decode(str) {
    gsub(/}/,"</foo>",str); gsub(/{/,"<foo>",str)
    gsub(/@C/,"}",str); gsub(/@B/,"{",str)
    gsub(/@A/,"@",str)
    return str
}

$ awk -f nonGreedy.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

The above works by you picking any character that can't appear JUST IN THE START/END STRINGS (note it doesn't have to be a character that can't appear in the input at all, just not in those strings), in this case I'm choosing @, and appending an A after each occurrence of it in the input. At this point every occurrence of @A represents an @ character and there are guaranteed to be no occurrences of @B or @ followed by anything else anywhere in the input.

Now we can pick 2 other characters that we want to use to represent the start/end strings, in this case I'm choosing { and }, and convert them to some @-prefixed strings like @B and @C and at this point every occurrence of @B represents a { character and @C represents a } character and there are no {s or }s anywhere in the input.

Now all that's left to do to find the strings we want to extract is convert every start string <foo> to the start character we've chosen, {, and every end string </foo> to the end character } and then we can use a simple regexp of {[^{}]*} to represent a non-greedy version of <foo>.*</foo>.

As we find each string we just unwind the conversions we did above in reverse order (note you must unwind the substitutions to each matched string in exactly the reverse order you applied them to the whole record) so { goes back to <foo> and @B goes back to {, and @A goes back to @, etc. and we have the original text for that string.

The above will work in any awk. If your start/end strings contain RE metacharacters then you'd have to escape those or use a while(index(substr())) loop instead of gsub() to replace them.

Note that if you do use gawk and the labels aren't nested then you can keep the 2 functions exactly as above and change the rest of the script to just:

BEGIN { FPAT="{[^{}]*}" }
{
    $0 = encode($0)
    for (i=1; i<=NF; i++) {
        print decode($i)
    }
}

Obviously you don't really need to put the encode/decode functionality in separate functions, I just separated that out here to make that functionality explicit and separate from the loop that uses it for clarity.

For another example of when/how to apply the above approach, see https://stackoverflow.com/a/40540160/1745001.

James Brown · Answer 2 · 2016-11-14T09:58:06.873

My (current version of) solution approaches the problem from the front so the output is not exactly the same:

<foo> .. 1                   # second
  <foo> .. a<2 .. </foo> ..  # first in my approach
</foo> 
<foo> .. @{<>}@              # fourth
  <foo> .. 4 .. </foo> ..    # third
</foo> 
<foo> .. 5 .. </foo>         # fifth

if the program would traverse the arrays arr and seps backwards, the output would be the same (probably), but I just ran out of time temporarily.

In Gnu awk (for using split with four params to parse the data).

EDIT For compatibility with others than Gnu awk I added function gsplit() which is a crude Gnu awk split replacement.

$ cat program.awk
{ data=data $0 }                         # append all records to one var
END {
    n=gsplit(data, arr, "</?foo>", seps) # split by every tag
    for(i=1;i<=n;i++) {                  # atm iterate arrays from front to back
        if(seps[i]=="<foo>")             # if element opening tag
            stack[++j]=seps[i] arr[i+1]  # store tag ang wait for closing tag
        else {
            stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
            print stack[j--] seps[i] 
        } 
        prev = seps[i]
    }
}

# elementary gnu awk split compatible replacement
function gsplit(str, arr, pat, seps,    i) {
    delete arr; delete seps; i=0
    while(match(str, pat)) {
        arr[++i]=substr(str,1,(RSTART-1))
        seps[i]=substr(str,RSTART,RLENGTH)
        str=substr(str,(RSTART+RLENGTH))
    }
    arr[++i]=substr(str,1)
    return i
}

Run it:

$ awk -f program.awk file
<foo> .. a<2 .. </foo>
<foo> .. 1  .. </foo>
<foo> .. 4 .. </foo>
<foo> .. @{<>}@  .. </foo>
<foo> .. 5 .. </foo>

Sounds good. The main point of this question though was to show how to implement non-greedy regexps in awk, there's lots of ways to parse the text in my example without doing that but that seems to be something non-obvious and with a solution that works for any awk. — Ed Morton, Nov 09 '16 at 22:33
FYI since you're already using GNU awk you don't need to create a variable `data` to hold the entire contents of the file (but if you did, it'd have to include the currently missing RT values too), you can just set `RS='^$'` and then use `$0` in the `END` section. You could just do this one line at a time though, it's easy to extend any solution to treat the whole file as one record if/when the text between tags contains newlines. — Ed Morton, Nov 10 '16 at 15:56

How do I use awk to extract data within nested delimiters using non-greedy regexps

2 Answers2

Linked