0

I know that default FS= " ", then why am i seeing variations in following awk commands. Please help me understand.

>echo "   ABC DEF   XYZ  \n   abc       def,ghi   xyz   \n" | awk '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 3 1:ABC line:    ABC DEF   XYZ  
nf: 3 1:abc line:    abc       def,ghi   xyz   
nf: 0 1: line: 
                                                                                                                                                               
>echo "   ABC DEF   XYZ  \n   abc       def,ghi   xyz   \n" | awk -F" " '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 3 1:ABC line:    ABC DEF   XYZ  
nf: 3 1:abc line:    abc       def,ghi   xyz   
nf: 0 1: line: 
                                                                                                                                                               
>echo "   ABC DEF   XYZ  \n   abc       def,ghi   xyz   \n" | awk -F"[ ]" '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 10 1: line:    ABC DEF   XYZ  
nf: 17 1: line:    abc       def,ghi   xyz   
nf: 0 1: line: 
                                                                                                                                                               
>echo "   ABC DEF   XYZ  \n   abc       def,ghi   xyz   \n" | awk -F"[ ]*" '{printf("nf: %s 1:%s line: %s\n", NF, $1, $0)}'
nf: 5 1: line:    ABC DEF   XYZ  
nf: 5 1: line:    abc       def,ghi   xyz   
nf: 0 1: line: 
                                         

I want to understand why there are no empty tokens in 1st & 2nd examples, but exists in 3rd & 4th examples.

Update: To explain my doubt further, awk behaves inconsistently with default FS and custom FS. See below examples.

>printf "ab  cd\nef gh\n" | awk -F" " '{ printf("nf: %d\t", NF); for (i=1;i<=NF;i++) printf("%02d:%s\t", i, $i); print ""}'
nf: 2   01:ab   02:cd   
nf: 2   01:ef   02:gh

>printf "ab::cd\nef:gh\n" | awk -F":" '{ printf("nf: %d\t", NF); for (i=1;i<=NF;i++) printf("%02d:%s\t", i, $i); print ""}'
nf: 3   01:ab   02:     03:cd   
nf: 2   01:ef   02:gh
Mohan
  • 141
  • 3
  • 13
  • 1
    `echo` does not interpret `'\n'` or any control characters. Add `echo -e` (or just use `printf` for everything) Your `printf()` in `awk` should be `printf("nf: %s 1:%s line: %s\n", NF, FNR, $0)` (`FNR` file record number, `NR` total record number -- will be the same in your case) Or `$1` is fine if you want to print the first field in each record there... – David C. Rankin Apr 20 '22 at 22:06
  • 1
    Also `FS = " "` is essentially the default. (whitespace). You can omit that and the *field separator* will break the fields on space and tab characters by default. All the field separators you use mimic the default behavior. – David C. Rankin Apr 20 '22 at 22:38
  • 1
    If you are looking for examples, try `printf "a,b:c,d\ne,f:g,h\n" | awk -F, '{ printf "nf: %d\t", NF; for (i=1;i<=NF;i++) printf "%-8s", $i; print ""}'` then change `FS` to `-F:`. – David C. Rankin Apr 20 '22 at 22:48
  • I am trying to understand the output difference for default vs -F"[ ]" & -F"[ ]*" – Mohan Apr 20 '22 at 23:36
  • 1
    `F"[ ]" ` mandates each field-separator as a *Single-Space*. In `" ABC DEF XYZ "` there are 10 space separators. In `-F"[ ]*"` the regular expression is *Zero-Or-More space* characters. In the first case, every space is a new field (3-empty fields before `"ABC"`), in the second case the first field is the only empty-field. I'd have to check to make sure this behavior would be consistent between `awk` versions. – David C. Rankin Apr 21 '22 at 00:45
  • Both -F" " and -F"[ ]" specifies field-separator as a single-space, then why they produce different outputs? Also how can I specify delimiter as one-or-more of some characters, -F"[ ]+" did not work. – Mohan Apr 21 '22 at 16:50
  • Because the REGEX `*` repetition qualifier means *zero-or-more occurrences* (of space), so your `FS` in that case is empty or as many spaces as there are between fields (default) . Actually I'm surprised `awk` doesn't puke on that one since from the zero or more standpoint the `FS` is ambiguously defined. – David C. Rankin Apr 21 '22 at 17:08
  • I understood about * (4th example). I was asking why 2nd & 3rd case produce different outputs where field-separator is just a single-space. – Mohan Apr 21 '22 at 17:12
  • Because in the second case you have `" "` which is the default `FS` string, non regex. So it will match *one-or-more* whitespace characters as separators. In the third case you mandate *one-space-only* with the regex `"[ ]"`. Since the regex `[ ]` has no repeat qualifier it means verbatim -- every space. – David C. Rankin Apr 21 '22 at 17:15
  • Thank you. one last question, can we specify delimiter as one-or-more of some characters with [] ? , -F"[ ]+" did not work. – Mohan Apr 21 '22 at 17:24
  • Yes, `awk` understands *Extended Regex* syntax, but the bigger question is why do you want to do this? In all these cases it is simple better to not set `FS` and use the default. Now you can specify multiple `FS` using a regex specifier, and there are other circumstances where a regex `FS` is needed, but it is hard to see how messing with a space regex benefits you here unless you have a need to do it. – David C. Rankin Apr 21 '22 at 18:20
  • This was just an example, my FS is not space. – Mohan Apr 21 '22 at 18:50
  • Updated the question with more clear example. – Mohan Apr 21 '22 at 19:17
  • Post the first line of `awk --version` so I know what you are using. Read through [GNU Awk User's Guide - Specifying How Fields Are Separated](https://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html#Field-Separators) That explains why you see what you are seeing and the difference between string and regex separators, and then difference between whitespace and other values of `FS`. Keep [GNU Awk User's Guide Contents](https://www.gnu.org/software/gawk/manual/html_node/index.html#SEC_Contents) it is a very good reference for `gawk` and `awk` showing how they differ as well. – David C. Rankin Apr 21 '22 at 20:35
  • >awk --version GNU Awk 4.1.0, API: 1.0 – Mohan Apr 21 '22 at 20:40
  • Perfect, the first two links in the page [GNU Awk User's Guide - Specifying How Fields Are Separated](https://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html#Field-Separators) explain exactly what is happening. When you use a space (default) `FS` any number of spaces are considered 1 separator. When you use a character (e.g. `':'`) each `':'` denotes a field. That's why a regex `[ ]` is different from the default and splits on every space. – David C. Rankin Apr 21 '22 at 20:44
  • Thank you. How can I mark your comments as answer ? – Mohan Apr 21 '22 at 20:48
  • I'll write one up for you. Give me a few minutes `:)` – David C. Rankin Apr 21 '22 at 22:34

1 Answers1

1

By default awk uses a single space as the default FS. This is a special case and is the only special case. Two or more spaces are not interpreted as multiple fields, but as a single separator. Using any other character causes each occurrence of that character to be interpreted as a separator. So using ':' will interpret ":::my" as four fields. (empty, empty, empty, "my") See: GNU Awk User's Guide - 4.5.1 Whitespace Normally Separates Fields.

When you use a Regular Expression, each occurrence of the FS character (even a space) is considered a separate field separator. See GNU Awk User's Guide - 4.5.2 Using Regular Expressions to Separate Fields.

To examine every character as a separate field, you can simply set FS to the empty-string (null), either on the command line with -F"" or by setting FS = "".

In your examples where you use the Regex -F"[ ]" each space is considered a separate field separator. FS is a Regex and not the default case. It is a Regex where the single character just happens to be a space.

With the repetition of * (zero-or-more) occurrences, the FS is a bit ambiguous. It can match nothing (null) or it can match as many spaces as there are in a row. (which is why it matches the very first character and then multiple spaces) I do not recommend messing with spaces and FS in this manner.

awk understands Extended Regular Expression (ERE) syntax, so you can use the '+' repetition specifier for one-or-more occurrences of the character.

Keep the GNU Awk User's Guide handy. It is a good reference for gawk as well as the other flavors of awk. In the guide if something is unique to gawk, it will be marked with a '#' in the guide to tell you. It usually explains (sometimes in a footnote) how the gawk behavior is different than POSIX awk or mawk, etc..

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85