1

I working a on bash script made by a colleague. Since there is no multidimensional data structure in bash, there was lots of workarounds to process the input files. When I saw his script, I thought Perl can handle that much more easily. I did a decent job, I think, at doing the conversion, but now I'm stuck. For those who know, the input files are in Variant Calling Format (VCF). Basically a tab-delimited text file with specific bits of information in specific columns. Here's what the input file looks like:

##fileformat=VCFv4.2
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  MBWGS139
AF2122_NC002945 1012102 .   A   G   1137.77 .   AC=2;AF=1.00;AN=2   GT:AD:DP:GQ:PL  1/1:0,29:29:87:1166,87,0
AF2122_NC002945 103811  .   C   G   241.84  .   AC=2;AF=1.00;AN=2   GT:AD:DP:GQ:PL  1/1:0,6:6:18:270,18,0

First thing I do is to store all the file information in a hash. Multiple files (or samples) are all integrated in that "master" hash, by looping each files line by line. The master hash in constructed like this:

push(@{ $vcfs{$sample}{$CHROM}{$POS} }, $ID, $REF, $ALT, $QUAL, $FILTER, $INFO, $FORMAT, $SAMPLE);

Then I need to filter out the "bad" positions. I'll skip that code here, but the bits of information that I keep are stored in a new hash which is created like this:

push ( @{ $fastas{$sample}{$chrom}{$pos} }, $allele);

Let's say that $allele = $ALT (but it's a bit more complicated that that). So inside %fasta it looks like that:

Sample1
    chrom1
        pos1 = A
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G
Sample2
    chrom1
        pos1 = C
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G

See this as if I had uses Data::Dumper to print it... You can also see that for that part of the code, I ditched a bunch a info.

The last filtering step is to remove the "alleles" that are common to all the samples at a specific position. In the previous example, only pos1 of chrom would be kept. The final hash would look like this:

Sample1
    chrom1
        pos1 = A
Sample2
    chrom1
        pos1 = C

I figured out a way to do it by ditching the position information. I'd need to keep it. I'd really appreciate if someone could help me out here. I can provide my current solution if that would help.

Thanks, Marco

1 Answers1

0

I found 2 ways to get the desired output. But, seeing as there are only 4 alleles to consider, I wonder how it is possible that there may be one that is unique in a larger data set.

The solution includes a %count hash that can be used to find unique alleles.

#!/usr/bin/perl
use strict;
use warnings;

my (%fastas, %counts);

my ($sample, $chr);
while (<DATA>) {
    if (/^(\S.+)$/) {
        $sample = $1;   
    }
    elsif (/^\t(\S.+)$/) {
        $chr = $1;  
    }
    elsif (/^\t\t(pos\d+)\s=\s([TAGC])$/) {
        my $pos = $1,
        my $allele = $2;
        push @{ $fastas{$sample}{$chr}{$pos} }, $allele;
        $counts{$allele}++; 
    }
    else {
        die "Unknown input line $.: <<$_>>";    
    }
}

for my $sample (keys %fastas) {
    my $chr_ref = $fastas{$sample};

    for my $chr (keys %$chr_ref) {
        my $pos_ref = $chr_ref->{$chr};

        for my $pos (keys %$pos_ref) {
            my $allele_ref = $pos_ref->{$pos};

            for my $allele (@$allele_ref) {
                print "$sample $chr $pos $allele\n"
                    if $counts{$allele} == 1;   
            }
        }
    }   
}

__DATA__
Sample1
    chrom1
        pos1 = A
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G
Sample2
    chrom1
        pos1 = C
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G

The second solution makes use of the Deep::Hash::Utils module. It simplifies the code by eliminating the nested loops in the first solution.

#!/usr/bin/perl
use strict;
use warnings;
use Deep::Hash::Utils 'reach';

my (%fastas, %counts);

my ($sample, $chr);
while (<DATA>) {
    if (/^(\S.+)$/) {
        $sample = $1;   
    }
    elsif (/^\t(\S.+)$/) {
        $chr = $1;  
    }
    elsif (/^\t\t(pos\d+)\s=\s([TAGC])$/) {
        my $pos = $1,
        my $allele = $2;
        push @{ $fastas{$sample}{$chr}{$pos} }, $allele;
        $counts{$allele}++; 
    }
    else {
        die "Unknown input line $.: <<$_>>";    
    }
}

while (my @list = reach(\%fastas)) {
    my $allele = $list[-1];
    print "@list\n" if $counts{$allele} == 1;
}

__DATA__
Sample1
    chrom1
        pos1 = A
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G
Sample2
    chrom1
        pos1 = C
        pos2 = T
    chrom2
        pos1 = A
        pos2 = G

On the data set you provided, both solutions printed Sample2 chrom1 pos1 C.

Chris Charley
  • 6,403
  • 2
  • 24
  • 26
  • Chris, sorry if I miss lead you here. Your solution is elegant. I was too much into it that I forgot to share lots of important information when I posted my question. I'll fix my original question. Let me know if you want to give it a try. – Marc-Olivier Duceppe Oct 03 '16 at 13:33
  • OK, even though I made mistakes in my original post, your solution can still be applied. The trick was to create %count and populate in the same loop than %fastas! Why didn't I think of this!!! I was stuck on trying to do it afterward. Thank you so much Chris! Marco – Marc-Olivier Duceppe Oct 03 '16 at 16:26
  • Plus I didn't know about Deep::Hash::Utils 'reach'. I does make the code more readable! – Marc-Olivier Duceppe Oct 03 '16 at 16:27
  • @Marc-Olivier Duceppe Glad you were able to adopt a solution from my post. Yes, that module can simplify some hash problems. – Chris Charley Oct 03 '16 at 16:40