0

I have a coding problem beyond my limited skills with unix power tools. I'm looking to count the number of sample with either: i) a homozygous variant in a gene (BB below); or ii) two variants in a gene (2x AB). For example, from:

Variant Gene    Sample1 Sample2 Sample3
   1    TP53    AA  BB  AB
   2    TP53    AB  AA  AB
   3    TP53    AB  AA  AA
   4    KRAS    AA  AB  AA
   5    KRAS    AB  AB  BB

I'm looking for:

Gene Two_variants Homozygous Either
TP53     2            1        3
KRAS     1            1        2 

Any help would be much appreciated. Thanks.

R_G

shellter
  • 36,525
  • 7
  • 83
  • 90
R_G
  • 1
  • 1
    How are `2 1 3` and `1 1 2` generated. Sorry not able to follow your question at all. – jaypal singh Jul 02 '13 at 21:24
  • Shouldn't the output be `1 1 2` for TP53? – gbrener Jul 02 '13 at 21:41
  • Hi Gregb. Thanks for your response. TP53 should be 2 1 3 as written. Sample 1 and 3 are AB for two variants each (AB x2), so the Two_variants column should be 2. Sample 2 is BB for variant 1, so Homozygous should be 1. Either is the sum of the two columns. – R_G Jul 23 '13 at 00:03

2 Answers2

1

In GNU awk:

awk '/\<AB\>.+\<AB\>/ { arr[$2,"AB"] += 1 }
             /\<BB\>/ { arr[$2,"BB"] += 1 }
                  END { for ( elt in arr ) {
                          split ( elt, index_parts, SUBSEP )
                          genes[index_parts[1]] = 0
                        }
                        printf "%4s%13s%11s%7s\n", "Gene", "Two_variants", "Homozygous", "Either"
                        for ( gene in genes ) {
                          printf "%4s%6d%13d%9d\n", gene, arr[gene,"AB"], arr[gene,"BB"], arr[gene,"AB"] + arr[gene,"BB"]
                        }
                      }' input.txt
gbrener
  • 5,365
  • 2
  • 18
  • 20
0
use warnings;
use strict;
my (@header, %data);
open(my $file, "<", "input") or die("$?");
while (<$file>) {
    @header = split, next if not @header;
    my @v = split;
    $data{$v[1]}->{$_}++ for (@v[2..$#v]);
}
close $file;
print "Gene Two_variants Homozygous Either\n";
for my $k (keys %data) {
    my ($var2, $homoz) = (int($data{$k}{AB}/2), $data{$k}{BB});
    my $sum = $var2 + $homoz;
    printf("%4s %8d %9d %8d\n", $k, $var2, $homoz, $sum) if $sum;
}
perreal
  • 94,503
  • 21
  • 155
  • 181