7

If the field separator is the empty string, each character becomes a separate field

$ echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
5,h,e,l,l,o

However, if FS is a regex that can possibly match zero times, the same behaviour does not occur:

$ echo hello | awk -F ' *' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello

Anyone know why that is? I could not find anything in the gawk manual. Is FS="" just a special case?

I'm most interested in understanding why the 2nd case does not split the record into more fields. It's as if awk is treating FS=" *" like FS=" +"

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • FWIW - on Mac OSX, your first example prints out `1,hello` - with a warning `awk: field separator FS is empty`. As answers mentioned, this is undefined behavior. Also note that `*` is not a regular expression - it is just the character `*`. To use a regex you would need something like `.*` - you will get "everything". – Floris Feb 27 '14 at 15:41
  • 2
    "" *is* a valid regular expression that matches zero or more spaces. – glenn jackman Feb 27 '14 at 15:49
  • @glennjackman I don't think the current posts answered your question. I tested a little with awks' `split()` and `match()` functions. same result. So I guess we have to read awk's regex match codes to understand if match `start=0, length=0` what does gawk handle the result. very likey (I didn't read the code yet) awk thinks it doesn't match, so the whole string/line would be as field. `*` is regex, in fact in your question, it is the same to :`echo hello|awk -F 'm*' ...` interesting question anyway. – Kent Feb 27 '14 at 15:52
  • @glennjackman I apologize I didn't notice the space. Conclusion seems to be that `awk regex cannot match zero characters"... but I have no good source for this (other than, like you, "observation"). – Floris Feb 27 '14 at 16:08
  • If you asked this question at the comp.lang.awk newsgroup, Arnold Robbins who provides/maintains gawk and wrote the most useful awk book around ("Effective Awk Programming") would almost certainly answer you there with chapter and verse. Just sayin'.... – Ed Morton Mar 18 '14 at 19:28

5 Answers5

4

Interesting question!

I just pulled gnu-awk 4.1.0's codes, I think the answer we could find in the file field.c.

line 371:
 * re_parse_field --- parse fields using a regexp.
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is a regular
 * expression -- either user-defined or because RS=="" and FS==" "
 */
static long
re_parse_field(lo...

also this line: (line 425):

if (REEND(rp, scan) == RESTART(rp, scan)) {   /* null match */

here is the case of <space>* matching in your question. The implementation didn't increment the nf, that is, it thinks the whole line is one single field. Note this function was used in do_split() function too.

First, if FS is null string, gawk separates each char into its own field. gawk's doc has clearly written this, also in codes, we could see:

line 613:
 * null_parse_field --- each character is a separate field
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is the null string.
 */
static long
null_parse_field(long up_to,

If the FS has single character, awk won't consider it as regex. This was mentioned in doc too. Also in codes:

#line 667
 * sc_parse_field --- single character field separator
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is a single character
 * other than space.
 */
static long
sc_parse_field(l

if we read the function, no regex match handling was done there.

In the comments of the function re_parse_field(), and sc_parse_field(), we see do_split invokes them too. It explains why we have 1 in following command instead of 3:

kent$  echo "foo"|awk '{split($0,a,/ */);print length(a)}'
1

Note, to avoid to make the post too long, I didn't paste the complete codes here, we can find the codes here:

http://git.savannah.gnu.org/cgit/gawk.git/

Kent
  • 189,393
  • 32
  • 233
  • 301
  • Thanks for digging that out. I see that RESTART would be equal to REEND, as that regex matches the empty string at the beginning of the string. So the answer to the question is "gawk works like that because that's how it's implemented." – glenn jackman Feb 27 '14 at 16:46
  • @glennjackman I think it is reasonable. if we write a regex `FS`, but one line doesn't match, what do we expect? I think in most case, we expect the `NF==1` instead of `NF=300`. The only thing we should be carefule is when we use `split()`. There we have the same rule as field handling. If we really want to split string with 0-length matching, then we have to use `""` – Kent Feb 27 '14 at 16:55
  • If it doesn't match, the `research()` function would return -1. The issue is what to do when it does match, but the matched text has zero length. gawk developers chose to not increment NF, perl developers made the opposite decision. – glenn jackman Feb 27 '14 at 17:07
  • @glennjackman yes you are right. that is what I meant. my "doesn't match" meant actually zero length match. `x*x*x*` case, not `^foo$` I didn't describe it clearly. – Kent Feb 27 '14 at 20:33
2

As was mentioned, an empty field separator generates undefined behavior; the same code will give different results on different platforms / flavors of awk. For example (all Mac OSX 10.8.5):

> echo hello | awk -F '' -v OFS=, '{$1 = NF OFS $1} 1'
awk: field separator FS is empty

1,hello

So awk complains, but keeps going.

Let's look at some other examples:

> echo hello | awk -F '.' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello

A . by itself is not considered a regular expression

> echo hello | awk -F '[.]' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello

Still nothing

> echo hello | awk -F '.?' -v OFS=, '{$1 = NF OFS $1} 1'
6,,,,,,

Now we have something like a regex: .? is "zero or one character". It is expanded to one character (which is consumed), so the output is "a whole lot of nothings"

> echo hello | awk -F '*' -v OFS=, '{$1 = NF OFS $1} 1'
1,hello

Not a regular expression

> echo hello | awk -F '.*' -v OFS=, '{$1 = NF OFS $1} 1'
2,,

A regular expression that consumes the entire thing

> echo hello | awk -F 'l' -v OFS=, '{$1 = NF OFS $1} 1'
3,he,,o

Match the letter l twice - two empty strings

> echo hello | awk -F 'ell' -v OFS=, '{$1 = NF OFS $1} 1'
2,h,o

Match all of ell at once

> echo hello | awk -F '.?|' -v OFS=, '{$1 = NF OFS $1} 1'
awk: illegal primary in regular expression .?| at 
 input record number 1, file 
 source line number 1

Attempt to be clever: sometimes an | with empty string on one side will match "anything" but awk's regex engine doesn't like it.

Conclusion - the regular expressions cannot match "empty", and whatever is matched is consumed. Attempts to use (?:.) or even (?=.) generate errors.

Floris
  • 45,857
  • 6
  • 70
  • 122
1

It seems to be a special case in gawk.

Traditionally, the behavior of FS equal to "" was not defined. In this case, most versions of Unix awk simply treat the entire record as only having one field. (d.c.) In compatibility mode (see Options), if FS is the null string, then gawk also behaves this way.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
1

What POSIX has to say about this:

If FS is a null string, the behavior is unspecified.

So the gawk behaviour is implementation-specific and sort of explains why your two examples don't yield the same output.

Adrian Frühwirth
  • 42,970
  • 10
  • 60
  • 71
0

Another data point: gawk and perl disagree on how to do this:

$ perl -E '$,=","; $s="hello"; $r=qr( *); @s=split($r,$s); say scalar(@s), @s'
5,h,e,l,l,o

$ gawk 'BEGIN {s="hello";r=" *";n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
1 hello
match
$ gawk 'BEGIN {s="hello";r="";  n=split(s,a,r); print n,a[n]; if (s~r) print "match"}'
5 o
match
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • 1
    my just posted an answer, I think it explains why awk behaves like that. btw, this is not an answer to your question. :) – Kent Feb 27 '14 at 16:31