reset NR in awk

Question

cat file.txt

MNS GYPA*N  
MNS GYPA*M  c.59T>C;c.71A>G;c.72G>T
MNS GYPA*Mc c.71G>A;c.72T>G
MNS GYPA*Vw c.140C>T
MNS GYPA*Mg c.68C>A
MNS GYPA*Vr c.197C>A
MNS GYPB*Mta    c.230C>T
MNS GYPB*Ria    c.226G>A
MNS GYPB*Nya    c.138T>A
MNS GYPA*Hut    c.140C>A
.
.
.

the second column values could start with GYPA,GYPB,GYPC,GYPD, ... GYPZ. I would like to set a position count for each GYP* and split the third column as follows:

1   MNS  GYPA*N
2   MNS GYPA*M  c.59T>C
2   MNS GYPA*M  c.71A>G
2   MNS GYPA*M  c.72G>T
3   MNS GYPA*Mc c.71G>A
3   MNS GYPA*Mc c.72T>G
4   MNS GYPA*Vw .140C>T
5   MNS GYPA*Mg c.68C>A
6   MNS GYPA*Vr c.197C>A
1   MNS GYPB*Mta    c.230C>T
2   MNS GYPB*Ria    c.226G>A
3   MNS GYPB*Nya    c.138T>A
4   MNS GYPB*Hut    c.140C>A
.
.
.

cat format.awk

BEGIN {FS=OFS="\t"}

$2 ~ /GYPA/
   { num=split($3,arr,/;/);
      for (i=1;i<=num;i++)
         { print NR,$1,$2,arr[i]}}

$2 ~ /GYPB/
   { num=split($3,arr,/;/);
      for (i=1;i<=num;i++)
         { print NR,$1,$2,arr[i]} }
...

I am not sure how to reset NR when it reaches the the next ~ GYP. The GYP{A..Z} are in order from A to Z.

score 3 · Accepted Answer · edited Mar 04 '20 at 01:58

awk '
{
  match($2,/[^*]*/)
  gy_value=substr($2,RSTART,RLENGTH)
}
gy_value!=prev_gy_value{
  count=0
}
!arr[$2]++{
  count++
}
{
  num=split($3,array,";")
  for(i=1;i<=num;i++){
    print count,$1,$2,array[i]
  }
}
NF<3;
{
  prev_gy_value=gy_value
}
' file.txt

Explanation: Adding a detailed explanation for above code.

awk '                                   ##Starting awk program from here.
{
  match($2,/[^*]*/)                     ##Using match function to match till * in 2nd field.
  gy_value=substr($2,RSTART,RLENGTH)    ##Creating variable gy_value which has sub-string of 2nd field sub-string in it.
}
gy_value!=prev_gy_value{                
  count=0                               ##Creating variable count as 0 here.
}
{
  count++                               ##Increasing value of count with 1 here.
}
{
  num=split($3,array,";")               ##Splitting 3rd field into an array with delimiter ; and its count is stored into num variable.
  for(i=1;i<=num;i++){                  ##Starting for loop from i=1 to till value of num here.
    print count,$1,$2,array[i]                ##Printing value of $1,$2 and array with index variable i here.
  }
}
NF<3;                                   ##Checking condition if NF<3 then print the line here.
{
  prev_gy_value=gy_value                ##Setting value of variable gy_value to variable named prev_gy_value here(which is used above code to make sure about values check).
}
'  Input_file                           ##Mentioning Input_file name here.

@user171558, its checking if 2nd field is NOT already present in array a then only go inside {..} to increase variable count's value, please do give me few mins will add detailed explanation too in my solution. — RavinderSingh13, Mar 03 '20 at 05:58
@user171558, Detailed explanation is added now, please go through it. — RavinderSingh13, Mar 03 '20 at 06:37
By redefining `$1` you recalculate `$0` and all fields. This might have unwanted effects. It might be better, just to say `print count,$1,$2,array[i]`, this avoids this conundrum. — kvantour, Mar 03 '20 at 14:28

kvantour · Answer 2 · 2020-03-03T14:36:46.213

I am not sure how to reset NR when it reaches the the next ~ GYP. The GYP{A..Z} are in order from A to Z.

It is not possible to reset or change internal awk variables like NR, FNR or NF. These values are set by awk. The easiest is to keep track of an alternative NR as a variable which can can name c or anything else. This value can be reset under any condition to any value you want.

Example: Have a counter that resets to one every time it sees foo in a record:

awk '{c++}($0 ~ /foo/){c=1}{print c,$0}'

In case of the OP, something like this might be used:

awk 'BEGIN{FS=OFS="\t"}
     {c++; key=substr($2,1,index($2,"*")-1)}
     (key != key_prev) { c=1 }
     { prefix="" }
     (key == "GYPA") { prefix="NM_002099.7:"}
     { num=split($3,a,";"); for(i=1;i<=num;++i) print c,$1,$2,prefix a[i] }
     { key_prev=key }' file

reset NR in awk

2 Answers2