Min-Max Normalization using AWK

Question

I dont know Why I am unable to loop through all the records. currently it goes for last record and prints the normalization for it.

Normalization formula:

New_Value = (value - min[i]) / (max[i] - min[i])

Program

{
    for(i = 1; i <= NF; i++)
    {
        if (min[i]==""){  min[i]=$i;}     #initialise min
        if (max[i]==""){  max[i]=$i;}     #initialise max
        if ($i<min[i]) {  min[i]=$i;}     #new min
        if ($i>max[i]) {  max[i]=$i;}     #new max
    }

}
END {
    for(j = 1; j <= NF; j++)
        {
        normalized_value[j] = ($j - min[j])/(max[j] - min[j]);
        print $j, normalized_value[j];
    }
}

Dataset

Current Output

Required Output

0.75 0.75 0.75 0.75
0.50 0.50 0.50 0.50 
0.00 0.00 0.00 0.00
0.25 0.25 0.25 0.25
1.00 1.00 1.00 1.00

`$j` in the `END` block will be value of the `j`'th field of the last record in the dataset. — jas, Jun 17 '16 at 16:52
Firstly, I don't understand your required output. Where are these numbers coming from? Min of what? Max of what? Secondly, you are referencing NF in your END clause, which really doesn't make sense. — Rumbleweed, Jun 17 '16 at 16:55
I am really new to awk bit and having a hard time understanding the awk language, sorry for that. The min and max are calculated from the dataset i have shown, I calculated min and max of each field(col.) and then finally normalizing each field using min-max algorithm. — Murlidhar Fichadia, Jun 17 '16 at 17:04

glenn jackman · Accepted Answer · 2016-06-17T18:19:39.617

5

I would process the file twice, once to determine the minima/maxima, once to calculate the normalized values:

awk '
    NR==1 {
        for (i=1; i<=NF; i++) {
            min[i]=$i
            max[i]=$i
        }
        next
    }
    NR==FNR {
        for (i=1; i<=NF; i++) {
            if      ($i < min[i]) {min[i]=$i}
            else if ($i > max[i]) {max[i]=$i}
        }
        next
    }
    {
        for (i=1; i<=NF; i++) printf "%.2f%s", ($i-min[i])/(max[i]-min[i]), FS
        print ""
    }
' file file
# ^^^^ ^^^^  same file twice!

outputs

0.75 0.75 0.75 0.75 
0.50 0.50 0.50 0.50 
0.00 0.00 0.00 0.00 
0.25 0.25 0.25 0.25 
1.00 1.00 1.00 1.00

edited Jun 17 '16 at 18:19

answered Jun 17 '16 at 16:51

glenn jackman

238,783
38
220
352

Thanks, I understood finally. – Murlidhar Fichadia Jun 17 '16 at 18:08
can you have a look at my another ques. on awk. http://stackoverflow.com/questions/37897154/one-nearest-neighbor-using-awk?noredirect=1#comment63248037_37897154 need help with it – Murlidhar Fichadia Jun 18 '16 at 13:30
This is very useful. My input data contains header and also a first column with ids. How can I do to include this info in my output using this script? – Alex May 08 '20 at 20:11
1

Assuming the header is the first line, add `FNR == 1 {next}` as the first line of the awk code. For the first column, consider how you have to adapt the for loops. – glenn jackman May 08 '20 at 22:01
I am using this script suggested to normalize my data, but I got this error message: awk: cmd. line:17: (FILENAME=input.txt FNR=1) fatal: division by zero attempted. How to solve it? – Alex Jul 14 '20 at 13:53
Did you adjust the `NR==1` condition if you have a header row? – glenn jackman Jul 14 '20 at 15:20
I removed the header. This error happens because I have one column with only 0 values. – Alex Jul 14 '20 at 21:12

score 0 · Answer 2 · answered Jan 14 '21 at 10:01

The given answer uses same file to be loaded twice, this can be avoided with following modified script:

# initialization on min, max and value array to be used later
NR == 1 {
    for (i=1; i<=NF; i++) {
        value[i] = $i
        min[i] = $i
        max[i] = $i
    }
}
# finding min and max for each column
NR > 1 {
    for (i=1; i<=NF; i++) {
        value[((NR-1)*NF)+i] = $i
        if      ($i < min[i])    {min[i] = $i}
        else if ($i > max[i])    {max[i] = $i}
    }
}
END {
    nrows = NF
    ncolumns = NR
    for (i=0; i<(ncolumns); i++ ) {
        for (j=1; j<(nrows); j++ ) {
            printf "%.2f%s", (value[(i*nrows)+j]-min[j])/(max[j]-min[j]), OFS
        }
        printf "%.2f\n", (value[(i*nrows)+j]-min[j])/(max[j]-min[j])
    }
}

Save the above awk script as norm.awk. You can run this from shell (and redirect if needed) as:

awk -f norm.awk data.txt > norm_output.txt

or you can run this norm.awk script from vim itself as:

:%!awk -f norm.awk

Which will replace the existing values with min-max normalized values.