A quite similar question was previously asked that resembled mine. However, mine is a bit more complex since my file has many columns and I have to split
starting from a specific column till the last.
I have a file in TSV format that has unique rows like this:
chr1 28812384 chr1_28812384_A_T A T 41.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0:25:25,0:50:0,75,749:.. 0/0:17:17,0:50:0,51,509:.. 0/0:16:16,0:48:0,48,479:.. 0/0:21:21,0:50:0,63,629:..
chr1 28812385 chr1_28812385_G_C G C 52.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC ./.:22:22,0:50:0,66,659:.. 0/1:16:16,0:48:0,48,479:.. 0/0:25:25,0:50:0,75,749:.. 0/0:16:16,0:48:0,48,479:..
chr1 28812386 chr1_28812386_T_A T A 16.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0:27:27,0:50:0,81,809:.. 0/0:24:24,0:48:0,48,719:.. 0/0:32:32,0:50:0,96,959:.. 0/0:19:19,0:50:0,57,569:.
chr1 28812387 chr1_28812387_T_C T C 7.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0:41:41,0:20:0,35,249:.. 0/0:10:10,0:40:0,13,29:.. 0/0:16:16,0:48:0,48,149:.. 1/1:1:1,0:50:0,23,229:..
It has 21 rows but 200000 columns.
My condition:
Starting from column 10, I wish to print the column if the unique value of the first field (separated by :
) contains at least one 1/1
or 0/1
.
Example:
Given the example above, applying the condition from column 10 onward, the first fields would be
chr1 28812384 chr1_28812384_A_T A T 41.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0 0/0 0/0 0/0
chr1 28812385 chr1_28812385_G_C G C 52.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC ./. 0/1 0/0 0/0
chr1 28812386 chr1_28812386_T_A T A 16.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0 0/0 0/0 0/0
chr1 28812387 chr1_28812387_T_C T C 7.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0 0/0 0/0 1/1
and the unique values of these columns are
0/0 0/0 0/0 0/0
./. 0/1 1/1
So, the columns to be printed are where 0/1
and 1/1
exists. The expected output is the following:
chr1 28812384 chr1_28812384_A_T A T 41.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0:17:17,0:50:0,51,509:.. 0/0:21:21,0:50:0,63,629:..
chr1 28812385 chr1_28812385_G_C G C 52.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/1:16:16,0:48:0,48,479:.. 0/0:16:16,0:48:0,48,479:..
chr1 28812386 chr1_28812386_T_A T A 16.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0:24:24,0:48:0,48,719:.. 0/0:19:19,0:50:0,57,569:.
chr1 28812387 chr1_28812387_T_C T C 7.0 . AF=5e-06;AQ=41 GT:DP:AD:GQ:PL:RNC 0/0:10:10,0:40:0,13,29:.. 1/1:1:1,0:50:0,23,229:..
My approach
My approach was using a mix of programs from Linux in multiple steps as following:
for i in {10..200000}; do echo $i $(awk -v i=$i -F $'\t' '{split($i,a,":"); print a[1]}' file.vcf | sort | uniq | tr '\n' ' '); done
With this method, I get unique values of column indexes. Then, I need to grep
the 0/1
and 1/1
to cut the indexes from the actual file.
I am pretty sure there could be a nicer approach to do it in one go with awk
. Any help would be appreciated.
Thanks in advance.