How to use multiple passes with gawk?

Question

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?

Here's my .awk file:

pass == 1
{
    print "pass1 is", pass;  
}    

pass == 2
{
if(pass == 2)
    print "pass2 is", pass;  
}

Here's my output (input file is just "hello):

hello
pass1 is 1
pass1 is 2
hello
pass2 is 2

Here's my command line:

gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt

I'd appreciate any help.

FYI GNU awk has a variable named `ARGIND` that makes your `pass` variables redundant. — Ed Morton, Dec 09 '15 at 00:31

F. Knorr · Accepted Answer · 2015-12-09T16:25:41.903

8

An (g)awk solution might look like this:

awk 'FNR == NR{print "1st pass"; next}
     {print "second pass"}' x.txt x.txt

(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):

awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
           $1==max'  x.txt x.txt

The output for x.txt:

6,5
2,6
5,7
6,9

is

6,5
6,9

How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.

edited Dec 09 '15 at 16:25

answered Dec 08 '15 at 17:52

F. Knorr

3,045
15
22

@MarkSetchell: You are right, the parentheses are no necessary. Therefore, I have updated my answer. Yet, for people like me who are used to Java / C ... the parentheses holding the condition are somewhat more familiar `(condition){code block}`. – F. Knorr Dec 08 '15 at 21:35
3

There's nothing gawk-specific in that script. To avoid requiring max to be >= 0 and to make your script portable to all awks (some awks will fail in some situations with unparenthesized ternary expressions) and easier to read, change your test to `FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}`. Any time you do a min or max calculation, seed with the first value read, don't assume/seed with some random value like zero. You can and should remove the `{print $0}` as that's the default action when a condition is true. – Ed Morton Dec 09 '15 at 00:21
@EdMorton: Thanks for the remarks. I have modified my answer accordingly (also giving you credit for it) – F. Knorr Dec 09 '15 at 16:26
Thanks for your help. I figured it out. See my answer below. – Steve Kolokowsky Dec 09 '15 at 23:14

ghoti · Answer 2 · 2015-12-09T13:18:27.763

4

So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.

But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)

awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile

Or, spaced out for easier reading:

BEGIN {
  FS=","
}

$1 > max {
  delete list           # empty the array
  n=0                   # reset the array counter
  max=$1                # set a new max
}

max==$1 {
  list[++n]=$0          # record the line in our array
}

END {
  for(i=1;i<=n;i++) {   # print the array in order of found lines.
    print list[i]
  }
}

With the same input data that F.Knorr tested with, I get the same results.

The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.

This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.

edited Dec 09 '15 at 13:18

answered Dec 08 '15 at 22:01

ghoti

45,319
8
65
104

Nice but by convention and as created by every awk function, awk arrays (and string char positions, and field numbers) start at one, not 0 so just tweak to `list[++n]=$0 .... for(i=1;i – Ed Morton Dec 09 '15 at 00:28
1

Thanks very much for pointing that out. In all my years of awk, that had never occurred to me. I've adjusted this answer, but .. now I have to comb through a small collection of other scripts to make similar adjustments. Whee! :-P – ghoti Dec 09 '15 at 01:04
1

@EdMorton - oh, and the condition in the for loop in END also needed to be changed to `i<=n`. – ghoti Dec 09 '15 at 13:19
Yeah I spotted that when I read your first comment above but by then it was too late to edit my comment and I figured you'd spot it right away too. – Ed Morton Dec 09 '15 at 13:49
@EdMorton, sorry to keep bringing you back to this, but I note that [gawk docs](https://www.gnu.org/software/gawk/manual/html_node/Array-Intro.html) states that "*Usually, an index in the array must be a nonnegative integer. For example, the index zero specifies the first element in the array.*" This appears to be a direct transcription of the Intro to Arrays in "nawk" docs. Do you have any references for the convention of starting arrays at 1? – ghoti Dec 09 '15 at 15:26
You cropped the important leading and trailing parts of the text surrounding that that quote which are `"In most other languages..."` and `"....Arrays in awk are different"`. I'll see if I can google a reference about the convention of starting numeric indices at 1 in awk. FWIW I found `Awk arrays can only have one dimension; the first index is 1` in https://en.wikibooks.org/wiki/An_Awk_Primer/Arrays. – Ed Morton Dec 09 '15 at 20:11
Also: http://www.ibm.com/developerworks/library/l-awk2/ `under awk, it's customary to start array indices at 1, rather than 0` – Ed Morton Dec 09 '15 at 20:18
Oddly negative way of stating that starting at 1 is the convention but `Unlike most awk arrays, ARGV is indexed from 0` (from the awk book http://www.gnu.org/software/gawk/manual/gawk.html#Auto_002dset) – Ed Morton Dec 09 '15 at 20:27

score 0 · Answer 3 · answered Dec 09 '15 at 23:19

The issue here is that newlines matter to awk.

# This does what I should have done: 
pass==1 {print "pass1 is", pass;} 
pass==2 {if (pass==2) print "pass2 is", pass;}

# This is the code in my question:
# When pass == 1, do nothing
pass==1 
# On every condition, do this
    {print "pass1 is", pass;} 
# When pass == 2, do nothing
pass==2 
# On every condition, do this
    {if (pass==2) print "pass2 is", pass;}

Using pass==1, pass==2 isn't as elegant, but it works.

How to use multiple passes with gawk?

3 Answers3

Linked

Related