awk regex start of line anchor matches whitespace

Question

Parsing an input file through awk I ran into an issue with anchors in awk.

Given the following file:

 2015
2015
test
 test

Output with awk

$ awk '$1 ~ /^[0-9]/' file
 2015
2015

Output with sed

$ sed -n '/^[0-9]/p' file
2015

Can somebody explain the behaviour I'm seeing in awk?

Seen with

CentOS 7, GNU bash 4.2.46, GNU Awk 4.0.2
AIX 7, GNU bash 4.3.30, awk (default version in AIX), and gawk 4.0.2

As pointed out. The regex is matching the given string, which is the first field in my example. The first field is defined as the first field which is not whitespace (unless FS is changed). — sastorsl, Jun 05 '15 at 18:04
FYI there is no `start of line anchor` for regexps. There are start and end of string anchors (`^` and `$`) and those often get confused as meaning start/end of line since some tools (e.g. sed and grep) process one line at a time by default. In this case you're asking awk to find a digit at the start of the string contained in `$1` and so it's doing that. — Ed Morton, Jun 05 '15 at 23:50

score 5 · Accepted Answer · answered Jun 05 '15 at 17:59

5

You will understand the difference with this awk command:

awk '/^[0-9]/' file
2015

Now awk is operating on full line like sed not just the first field.

$1 ~ /^[0-9]/ only compares first field and since whitespace is default field separator in awk therefore first field is 2015 in both the lines irrespective of spaces before it.

answered Jun 05 '15 at 17:59

anubhava

761,203
64
569
643

I just caught my own error, about 5 seconds after posting. `awk '$0 ~ /^[0-9]/` will be a more explicit way to express what you just wrote. I was matching the first field. Ashamed, now... – sastorsl Jun 05 '15 at 18:01
Yes indeed, `/^[0-9]/` is shortcut for `$0 ~ /^[0-9]/` – anubhava Jun 05 '15 at 18:01
@sastorsl, in addition to what anubhava posted about whitespace as a delimiter, it's worth noting that whitespace at the beginning of the line, before the first non-whitespace character, is *not* treated as a delimiter (which would make `$1==""`, which is not the case). – ghoti Jun 05 '15 at 18:10
@ghoti, exactly. Which is interesting if one compares with `echo " ;x;y" | awk -F\; '{ print "XX" $1 "XX" }'` - giving "XX XX" – sastorsl Jun 05 '15 at 18:13
1

@sastorsl it's not that interesting, it's just you telling awk what to do and awk doing it. When you set `FS=" "` (which is the default value) you are TELLING awk during field splitting to ignore leading and trailing white space from each record and treat all chains of contiguous white space as field separators. Every other value of FS is taken at face value. If you want a literal blank char as the FS then you need to write `FS="[ ]"`. It's awk fundamentals. I recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins. – Ed Morton Jun 05 '15 at 23:14

score 4 · Answer 2 · answered Jun 05 '15 at 17:59

4

The problem is you are picking the first field.

You should be doing awk '/^[0-9]/' file which matches the whole line.

To be more precise:

awk '$0 ~ /^[0-9]/' file

Is what you want, as $0 is the whole line.

answered Jun 05 '15 at 17:59

bkmoney

1,256
1
11
19

awk regex start of line anchor matches whitespace

2 Answers2