0

A quite similar question was previously asked that resembled mine. However, mine is a bit more complex since my file has many columns and I have to split starting from a specific column till the last.

I have a file in TSV format that has unique rows like this:

chr1    28812384        chr1_28812384_A_T       A       T       41.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:25:25,0:50:0,75,749:..      0/0:17:17,0:50:0,51,509:..   0/0:16:16,0:48:0,48,479:..       0/0:21:21,0:50:0,63,629:..
chr1    28812385        chr1_28812385_G_C       G       C       52.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      ./.:22:22,0:50:0,66,659:..      0/1:16:16,0:48:0,48,479:..   0/0:25:25,0:50:0,75,749:..       0/0:16:16,0:48:0,48,479:..
chr1    28812386        chr1_28812386_T_A       T       A       16.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:27:27,0:50:0,81,809:..      0/0:24:24,0:48:0,48,719:..  0/0:32:32,0:50:0,96,959:..       0/0:19:19,0:50:0,57,569:.
chr1    28812387        chr1_28812387_T_C       T       C       7.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:41:41,0:20:0,35,249:..      0/0:10:10,0:40:0,13,29:..   0/0:16:16,0:48:0,48,149:..       1/1:1:1,0:50:0,23,229:..

It has 21 rows but 200000 columns.

My condition:

Starting from column 10, I wish to print the column if the unique value of the first field (separated by :) contains at least one 1/1 or 0/1.

Example:

Given the example above, applying the condition from column 10 onward, the first fields would be

chr1    28812384        chr1_28812384_A_T       A       T       41.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0      0/0   0/0       0/0
chr1    28812385        chr1_28812385_G_C       G       C       52.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      ./.      0/1   0/0       0/0
chr1    28812386        chr1_28812386_T_A       T       A       16.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0      0/0  0/0       0/0
chr1    28812387        chr1_28812387_T_C       T       C       7.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0      0/0   0/0       1/1

and the unique values of these columns are

0/0       0/0       0/0       0/0
./.       0/1                 1/1

So, the columns to be printed are where 0/1 and 1/1 exists. The expected output is the following:

chr1    28812384        chr1_28812384_A_T       A       T       41.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:17:17,0:50:0,51,509:..       0/0:21:21,0:50:0,63,629:..
chr1    28812385        chr1_28812385_G_C       G       C       52.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/1:16:16,0:48:0,48,479:..       0/0:16:16,0:48:0,48,479:..
chr1    28812386        chr1_28812386_T_A       T       A       16.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:24:24,0:48:0,48,719:..       0/0:19:19,0:50:0,57,569:.
chr1    28812387        chr1_28812387_T_C       T       C       7.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:10:10,0:40:0,13,29:..       1/1:1:1,0:50:0,23,229:..

My approach

My approach was using a mix of programs from Linux in multiple steps as following:

for i in {10..200000}; do echo $i $(awk -v i=$i -F $'\t' '{split($i,a,":"); print a[1]}' file.vcf | sort | uniq | tr '\n' ' '); done

With this method, I get unique values of column indexes. Then, I need to grep the 0/1 and 1/1 to cut the indexes from the actual file.

I am pretty sure there could be a nicer approach to do it in one go with awk. Any help would be appreciated.

Thanks in advance.

user324810
  • 597
  • 8
  • 20
  • Regarding `So, the columns to be printed are where 0/1 and 1/1 exists` you mean `0/1` OR `1/1` could come anywhere in between values of fields too? Kindly confirm on same. – RavinderSingh13 May 27 '21 at 12:16
  • sorry for the confusion, I meant `OR`. if either `1/1` or `0/1` is found in the column, then it should be printed. This information is always found in the first field of each column starting from column 10. – user324810 May 27 '21 at 12:21
  • Ok, so you want to check column wise values, if 1/1 or 0/1 are present in any of the values then whole column should be printed? – RavinderSingh13 May 27 '21 at 12:57
  • Yes, that's exactly what I am trying to achieve. – user324810 May 27 '21 at 13:04

2 Answers2

2

The problem would be simpler if the data were transposed.

GNU datamash provides such functionality:

datamash transpose <in.tsv |\
sed '10,$ { /\(^\|\t\)[01]\/1:/ ! d }' |\
datamash transpose >out.tsv
jhnc
  • 11,310
  • 1
  • 9
  • 26
  • btw, I used `\(^\|\t\)` instead of the neater `\<` because I'm not certain `\<` (or `\b`) is [posix](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html) – jhnc May 27 '21 at 15:00
  • actually, posix sed doesn't understand alternation (`\|`) either. The equivalent awk is: `...| awk '/(^|\t)[01]\/1:/ || NR<10' | ...` – jhnc May 27 '23 at 03:16
2
$ cat tst.awk
BEGIN {
    beg = 10
    for (i=1; i<beg; i++ ) {
        cols[++numCols] = i
    }
    split("0/1 1/1",tmp)
    for (i in tmp) {
        good[tmp[i]]
    }
    FS = OFS = "\t"
}
NR==FNR {
    for (i=beg; i<=NF; i++) {
        sub(/:.*/,"",$i)
        if ( $i in good ) {
            cols[++numCols] = i
        }
    }
    next
}
{
    for (i=1; i<=numCols; i++) {
        printf "%s%s", $(cols[i]), (i<numCols ? OFS : ORS)
    }
}

$ awk -f tst.awk file file
chr1    28812384        chr1_28812384_A_T       A       T       41.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:17:17,0:50:0,51,509:..    0/0:21:21,0:50:0,63,629:..
chr1    28812385        chr1_28812385_G_C       G       C       52.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/1:16:16,0:48:0,48,479:..    0/0:16:16,0:48:0,48,479:..
chr1    28812386        chr1_28812386_T_A       T       A       16.0    .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:24:24,0:48:0,48,719:..    0/0:19:19,0:50:0,57,569:.
chr1    28812387        chr1_28812387_T_C       T       C       7.0     .       AF=5e-06;AQ=41  GT:DP:AD:GQ:PL:RNC      0/0:10:10,0:40:0,13,29:..     1/1:1:1,0:50:0,23,229:..
Ed Morton
  • 188,023
  • 17
  • 78
  • 185