-1

I having trouble parsing out a GFF file. I am using the code below as a one liner. I am obtaining an output filtered based on column 1 ($1) but when I add the additional filter of greater than 5000 but less than 150000, awk does not filter out my file appropriately. I am misunderstanding something and I am not quite sure what it is.

awk '{ $1 = "s10"; 
$4 >= 50000 && $4 <=150000; 
print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6""\t"$7"\t"$8"\t"$9}' infile > outfile 

input

S03       GeneWise        mRNA    7000       84000     40.00   -       .       ID=NA;Source=NA;Function="NA";
S07       GeneWise        CDS     80450       96070     .       -       0       Parent=NA;
S10       GeneWise        mRNA    96000       105032     50.00   -       .       ID=NA;Source=NA;Function="NA";
S10       GeneWise        CDS     43800       76000     .       -       0       Parent=NA;
S10      GeneWise        mRNA    175032       190540     41.11   +       .       ID=NA;Source=NA;Function="NA";
S11       GeneWise        CDS     3700       15000     .       +       0       Parent=NA;
S15       GeneWise        mRNA    18055       25000     40.00   -       .       ID=S15;Source=NA;Function="NA";

output i am obtaining with the error

S10       GeneWise        mRNA    96000       105032     50.00   -       .       ID=NA;Source=NA;Function="NA";
S10       GeneWise        CDS     43800       76000     .       -       0       Parent=NA;
S10      GeneWise        mRNA    175032       190540     41.11   +       .       ID=NA;Source=NA;Function="NA";

expected output

S10       GeneWise        mRNA    96000       105032     50.00   -       .       ID=NA;Source=NA;Function="NA";
serious
  • 147
  • 7
  • Without seeing the sample of input and sample output it is very difficult to say anything, please add samples in your post in code tags. – RavinderSingh13 May 22 '18 at 02:40
  • 1
    What is `{ $4 >= 50000 && $4 <=150000; }` supposed to do? Do you mean `if($4 >= 50000 && $4 <=150000) print ...` instead? – James Brown May 22 '18 at 03:05
  • would using `if($4 >= 50000 && $4 <=150000) print` solve my issue? i do mean to use a conditional statement. – serious May 22 '18 at 03:11
  • ... or maybe `if($1=="s10" && $4 >= 50000 && <=150000) print ...` and in that case are you sure the second outputed record record should be there since `$4==43800`? – James Brown May 22 '18 at 03:12
  • @JamesBrown I attempted use your conditional statement and it didn't filter based on $4. i tried to use `if` previously – serious May 22 '18 at 03:16
  • I uploaded a very small subset of my data that included some of the $4 that would fit this statement and others that wouldn't. – serious May 22 '18 at 03:18
  • The output I am receiving includes the error I am having. – serious May 22 '18 at 03:32
  • I don't see an error posted with the question or even a mention of it. Please, post all the information you have on the subject, don't make us ask for it piece by piece. – James Brown May 22 '18 at 03:50
  • 1
    I have edited the output I was receiving with the error versus desired. Apologies for not making it clearer. I have approved your first one liner. I was trying something similar to it but it appears I was misplacing the { }. Thanks! – serious May 22 '18 at 04:52
  • It's not possible for the script you posted to produce the output you say it does. When asking for help fixing a script it's important to show us the actual script you need help with rather than some other script. Doing otherwise is like asking a mechanic for help fixing your car and showing her your horse. – Ed Morton May 22 '18 at 12:01

1 Answers1

2

This is the correct form for the conditional. However, there is only one matching record for it:

$ awk ' 
$1 == "S10" && $4 >= 50000 && $4 <=150000 { 
    print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9
}' file
S10     GeneWise        mRNA    96000   105032  50.00   -       .       ID=NA;Source=NA;Function="NA";

unless you want records with $1 == "S10" || $4 $4 >= 50000 && $4 <=150000 ie. using logical OR) but that would bring one extra record:

awk ' 
$1 == "S10" || $4 >= 50000 && $4 <=150000 { 
    print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9
}' file
S07     GeneWise        CDS     80450   96070   .       -       0       Parent=NA;
S10     GeneWise        mRNA    96000   105032  50.00   -       .       ID=NA;Source=NA;Function="NA";
S10     GeneWise        CDS     43800   76000   .       -       0       Parent=NA;
S10     GeneWise        mRNA    175032  190540  41.11   +       .       ID=NA;Source=NA;Function="NA";

Better form of the first:

$ awk ' 
BEGIN { OFS="\t" }                           # define OFS to \t
$1 == "S10" && $4 >= 50000 && $4 <=150000 { 
    $1=$1                                    # rebuild the record
    print                                    # output
}' file
James Brown
  • 36,089
  • 7
  • 43
  • 59