-1

I am new to text preprocessing and AWK language.

I am trying to loop through each record in a given field(field1) and find the max and min of values and store it in a variable.

Algorithm :

1) Set Min = 0 and Max = 0

2) Loop through $1(field 1)

3) Compare FNR of the field 1 and set Max and Min

4) Finally print Max and Min

this is what I tried :

BEGIN{max = 0; min = 0; NF = 58}
{
     for(i = 0; i < NF-57; i++)
     {

           for(j =0; j < NR; j++)
           {
             min = (min < $j) ? min : $j
             max = (max > $j) ? max : $j
           }
     }
}
END{print max, min}

#Dataset
f1  f2  f3  f4 .... f58
0.3 3.3 0.5 3.6
0.9 4.7 2.5 1.6 
0.2 2.7 6.3 9.3
0.5 3.6 0.9 2.7
0.7 1.6 8.9 4.7

Here, f1,f2,..,f58 are the fields or columns in Dataset.

I need to loop through column one(f1) and find Min-Max.

Output Required: Min = 0.2 Max = 0.9

What I get as a result: Min = ''(I dont get any result) Max = 9.3(I get max of all the fields instead of field1)

This is for learning purpose so I asked for one column So that I can try on my own for multiple columns

These is what I have:

This for loop would only loop 4 times as there r only four fields. Will the code inside the for loop execute for each record that is, for 5 times?

for(i = 0; i < NF; i++)
{
    if (min[i]=="") min[i]=$i
    if (max[i]=="") max[i]=$i
    if ($i<min[i]) min[i]=$i
    if ($i>max[i]) max[i]=$i
}

END
{
    OFS="\t"; 
    print "min","max";
    #If I am not wrong, I saved the data in an array and I guess this would be the right way to print all min and max?
    for(i=0; i < NF; i++;)
    {
            print min[i], max[i]
    }
}
Murlidhar Fichadia
  • 2,589
  • 6
  • 43
  • 93

2 Answers2

4

Here is a working solution which is really much easier than what you are doing:

/^-?[0-9]*(\.[0-9]*)?$/ checks that $1 is indeed a valid number, otherwise it is discarded.

sort -n | awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {a[c++]=$1} END {OFS="\t"; print "min","max";print a[0],a[c-1]}'

If you don't use this, then min and max need to be initialized, for example with the first value:

awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {if (min=="") min=$1; if (max=="") max=$1; if ($1<min) min=$1; if ($1>max) max=$1} END {OFS="\t"; print "min","max";print min, max}'

Readable versions:

sort -n | awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
  a[c++]=$1
}
END {
  OFS="\t"
  print "min","max"
  print a[0],a[c-1]
}'

and

awk '
  $1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
    if (min=="") min=$1
    if (max=="") max=$1
    if ($1<min) min=$1
    if ($1>max) max=$1
  }
  END {
    OFS="\t"
    print "min","max"
    print min, max
  }'

On your input, is outputs:

min     max
0.2     0.9

EDIT (replying to the comment requiring more information on how awk works):

Awk loops through lines (named records) and for each line you have columns (named fields) available. Each awk iteration reads a line and provides among others the NR and NF variables. In your case, you are only interested in the first column, so you will only use $1 which is the first column field. For each record where $1 is matching /^-?[0-9]*(\.[0-9]*)?$/ which is a regex matching positive and negative integers or floats, we are either storing the value in an array a (in the first version) or setting the min/max variables if needed (in the second version).

Here is the explanation for the condition $1 ~ /^-?[0-9]*(\.[0-9]*)?$/:

  • $1 ~ means we are checking if the first field $1 matches the regex between slashes
  • ^ means we start matching from the beginning of the $1 field
  • -? means an optional minus sign
  • [0-9]* is any number of digits (including zero, so .1 or -.1 can be matched)
  • ()? means an optional block which can be present or not
  • \.[0-9]* if that optional block is present, it should start with a dot and contain zero or more digits (so -. or . can be matched! adapt the regex if you have uncertain input)
  • $ means we are matching until the last character from the $1 field

If you wanted to loop through fields, you would have to use a for loop from 1 to NF (included) like this:

echo "1 2 3 4" | awk '{for (i=1; i<=NF; i++) {if (min=="") min=$(i); if (max=="") max=$(i); if ($(i)<min) min=$(i); if ($(i)>max) max=$(i)}} END {OFS="\t"; print "min","max";print min, max}'

(please note that I have not checked the input here for simplicity purposes)

Which outputs:

min     max
1       4

If you had more lines as an input, awk would also process them after reading the first record, example with this input:

1 2 3 4
5 6 7 8

Outputs:

min     max
1       8

To prevent this and only work on the first line, you can add a condition like NR == 1 to process only the first line or add an exit statement after the for loop to stop processing the input after the first line.

Camusensei
  • 1,475
  • 12
  • 20
  • Thank you for the solution, but Can you please explain the regex code"$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ " ? Also, could you explain how does AWK work in terms of columns and rows? Like I dont understand if I need to loop through records in a field how to do it in awk style Or how to loop though fields in awk style. I dont mind if you share a link that explain AWK's basic principle – Murlidhar Fichadia Jun 17 '16 at 12:07
  • 1
    I hope the pieces of information I added will satisfy your curiosity :) – Camusensei Jun 17 '16 at 12:29
  • I added detailed regex explanation. – Camusensei Jun 17 '16 at 12:45
  • I saw the explaination and I understood the working of it. I am trying to push the solution for multiple columns now on my own. And I have modified your piece of code to work for multiple cols too. I will updated the code below, please have a look at it. – Murlidhar Fichadia Jun 17 '16 at 13:10
  • Please check my code for multiple columns, what you think of it, I have commented the bits I dont understand – Murlidhar Fichadia Jun 17 '16 at 13:17
  • This should be asked in a separate question... I don't know where I should correct your code... you are doing lots of syntax errors in your new code. missing `{}` around the `for` loop, `i` ranging from `0..NF-1` instead of `1..NF`, extra `;`, extra newline, unclear requirement, ... In any case, here is a fixed version: http://sprunge.us/aAdL (instead of `./temp < temp2`, I could have used other calling methods, like `awk -f temp < temp2` or `./temp temp2`...) – Camusensei Jun 17 '16 at 14:21
  • I got it working for all the columns, check my new post : http://stackoverflow.com/questions/37883643/awk-how-records-and-fields-are-executed-and-read?noredirect=1#comment63222093_37883643 – Murlidhar Fichadia Jun 17 '16 at 14:26
2

If you're looking to only column 1, you may try this:

awk '/^[[:digit:]].*/{if($1<min||!min){min=$1};if($1>max){max=$1}}END{print min,max}' dataset

The script looks for line starting with digit and set the min or max if it didn't find it before.

oliv
  • 12,690
  • 25
  • 45
  • The task is to find min and max of all the fields. As you can see, the expected max is 9.3 and is located in the fourth column – GMichael Jun 17 '16 at 11:57
  • 2
    @GMichael it's not what i read from the text: _Output Required: Min = 0.2 Max = 0.9_ – oliv Jun 17 '16 at 11:59
  • This works, good job. Just remove the useless `cat` because awk can do the job just fine (^^) – Camusensei Jun 17 '16 at 12:02
  • Very nice usage of `|!min`, I'll be sure to use it in the future. Of course, if the input may contain only negative numbers, you'd have to add `|!max` as well :) – Camusensei Jun 17 '16 at 12:50