4

I want to extract the information from a large file based on multiple conditions (from the same file) as well as pattern searching from other small file, Following is the script I used:

awk 'BEGIN{FS=OFS="\t"}NR==FNR{a[$0]++;next}$1 in a {print $2,$4,$5}' file2.txt file1.txt >output.txt

Now, I want to use the condition in the same awk script that ONLY print the line where the element of 4th column (any one character amongst the ATGC) matches the element of 5th column (any one character amongst the ATGC); both the column is in file 1.

Hence, in a way, I want to merge the following script with the script mentioned above:

awk '$4 " "==$5{print $2,$4,$5}' file1.txt

Following is the representation of file1.txt:

SNP Name    Sample ID   GC Score    Allele1 - Forward   Allele2 - Forward
ARS-BFGL-BAC-10172  834269752   0.9374  A   G
ARS-BFGL-BAC-1020   834269752   0.9568  A   A
ARS-BFGL-BAC-10245  834269752   0.7996  C   C
ARS-BFGL-BAC-10345  834269752   0.9604  A   C
ARS-BFGL-BAC-10365  834269752   0.5296  G   G
ARS-BFGL-BAC-10591  834269752   0.4384  A   A
ARS-BFGL-BAC-10793  834269752   0.9549  C   C
ARS-BFGL-BAC-10867  834269752   0.9400  G   G
ARS-BFGL-BAC-10951  834269752   0.5453  T   T


enter code here

Following is the representation of file2.txt

    ARS-BFGL-BAC-10172
    ARS-BFGL-BAC-1020
    ARS-BFGL-BAC-10245
    ARS-BFGL-BAC-10345
    ARS-BFGL-BAC-10365
    ARS-BFGL-BAC-10591
    ARS-BFGL-BAC-10793
    ARS-BFGL-BAC-10867
    ARS-BFGL-BAC-10951

Output should be:

834269752   A   A
834269752   C   C
834269752   G   G
834269752   A   A
834269752   C   C
834269752   G   G
834269752   T   T
Maulik Upadhyay
  • 127
  • 1
  • 12

1 Answers1

1

You can simply use boolean logic, and from your input file it seems you can get away with "normal" input field splitting, which will allow you to get rid of that space in the comparison:

awk 'BEGIN{OFS="\t"}
     NR==FNR{a[$0]++;next}
     ($1 in a) && ($4==$5) {print $2,$4,$5}' file2.txt file1.txt > output.txt

As an example, here is my test file2.txt:

ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10172

And here is the result of the command above:

834269752   A   A
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • I already tried boolean in every possible way but its not working – Maulik Upadhyay Nov 20 '14 at 15:12
  • 1
    What do you mean when you say "it's not working"? Do you get an error, do you get an unexpected output? – chthonicdaemon Nov 20 '14 at 15:14
  • Sorry to mention that it yields me a blank file, If I use both the commands separately then its working but only after copying all the content from the output file of the first command into a new file, then running a second command separately – Maulik Upadhyay Nov 20 '14 at 15:17
  • 1
    It would be very useful if you edit your question to show exactly what your working workflow is, including the output of `head file1.txt` and `head file2.txt`. If you use the output from the first command as an input of the second, I don't understand how you still have enough columns. – chthonicdaemon Nov 20 '14 at 15:21
  • I cannot upload the print screen shot of the file as my reputation is less than 10 (this is my first question) and I am not able to post the result of "head file1.txt" in a proper way on this forum (when I copy the file into the forum, tab-delimited representation of the file gets disturbed), can you please help me? – Maulik Upadhyay Nov 20 '14 at 15:27
  • Just copy and paste the result of `head`, then select it and press Control-k, which will indent it four spaces and make it appear as code. – chthonicdaemon Nov 20 '14 at 15:29
  • It appears that the problem may have been that space in the comparison. If you just use "normal" whitespace splitting, the comparison becomes easier, and then it works for me. – chthonicdaemon Nov 20 '14 at 15:43
  • For me it only worked after I copied all the content from large file (5.9 GB) into another new file (strange thing is the size of new file which is 5.7 GB)..Thank you very much for the help – Maulik Upadhyay Nov 21 '14 at 14:22