In AWK (or GAWK), how to determine where does n-th word start?

Question

Example: n=3, the input is

foo  bar    baz
a b c d e f g h
12  34 5 678

the output should be:

13
5
8

James Brown · Answer 1 · 2017-08-01T07:37:26.083

The simplest would probably be:

$ awk -v n=3 '{print index($0,$n)}' file
13
5
8

but it's error prone, and would require some checking. $n is the third word (or field separated by FS the field separator). index returns the position in characters where that occurrence begins. If the FS is default (space and then some) you'd probably want to start with a space and add one to the position:

$ awk -v n=3 '{print 1 + index($0," " $n)}' file
13
5
8

... as pointed out in the comments is also error prone to n=1 or if the nth word matches the beginning of a prior word.

We could use GNU awk's split's seps feature:

$ awk -v n=3 '{
    s=1                      # reset s to 1
    split($0,a,/ +/,b)       # split to a and separators to b
    for(i=1;i<n;i++)         # iterate to n
        s+=length(a[i] b[i]) # sum the lengths of a b
    print s                  # print the position
}' file
13
5
8

Alas, this is too error-prone if, for example, `$2==$n` – user31264 Aug 01 '17 at 07:25 — user31264, Aug 01 '17 at 07:25
@user31264 Sure is. This was just a start. :D – James Brown Aug 01 '17 at 07:26 — James Brown, Aug 01 '17 at 07:26

Tom Fenech · Answer 2 · 2017-08-01T08:03:51.050

You can use match to do this:

$ awk 'match($0, /[[:blank:]]*([^[:blank:]]+[[:blank:]]+){2}/) {
    print RLENGTH + 1 
}' file
13
5
8

Or using a parameter with a dynamic regex:

$ awk -v n=3 'match($0, "[[:blank:]]*([^[:blank:]]+[[:blank:]]+){" n - 1 "}") { 
    print RLENGTH + 1 
}' file
13
5
8

This searches for optional leading blanks (spaces or tabs), followed by something non-blank, followed by something blank, n - 1 times, where n is the word number. match sets the variables RSTART and RLENGTH (in this case, RSTART == 1). RLENGTH gives the length of the match, so one character after that is where the nth word starts.

Since you mentioned GNU awk, you can shorten things by using \s (which is actually [[:space:]], but that works here too) and non-space \S:

$ awk -v n=3 'match($0, "\\s*(\\S+\\s+){" n - 1 "}") { print RLENGTH + 1 }' file

In dynamic regex, the backslashes themselves need to be escaped.

This is definitely the right approach when using the default FS and when using an FS that's a single character `X` it'd be a tweak to change `[:blank:]` to `X` and remove the leading `X*`. When using an FS that's a multi-char regexp, though, it gets a whole lot more interesting :-)! — Ed Morton, Aug 01 '17 at 13:20

Ed Morton · Answer 3 · 2017-08-01T13:37:42.583

This will work for any field separator, including multi-char regexp, using GNU awk for the 4th arg to split():

$ cat tst.awk
{    
    split($0,flds,FS,seps)
    indent = 1
    for (i=0; i<n; i++) {
        indent += length(flds[i] seps[i])
    }
    print indent
}

$ awk -v n=3 -f tst.awk file
13
5
8

or with multi-char strings of .+. or .-. between fields:

$ cat file2
foo.+.bar.-.baz
a.+.b.-.c.+.d.-.e.+.f.-.g.+.h
12.-.34.+.5.-.678

$ awk -F'[.][+-][.]' -v n=3 -f tst.awk file2
13
9
11

Note that since we're using FS as an argument to split() it will be treated as a dynamic regexp (i.e. one stored in a string) and so any backslashes in the FS would need to be doubled.

Also note that we start the counting loop at 0, not 1, because with the default FS any leading white space before flds[1] (i.e. before $1) is stored in seps[0]. flds[0] will always be empty and for non-default FS seps[0] will also be empty to no harm done including their length in all cases.

In AWK (or GAWK), how to determine where does n-th word start?

3 Answers3