I am trying to analyze several thousand sequences for double restriction enzyme cuts. For this example, let's say these two enzymes are EcoRI and MspI. I am using Bioperl modules for this (see the code below) and can make it work for single enzymes. It's just that the two enzymes working altogether are giving either weird results, or not working as expected.
#!/usr/bin/perl
use strict;
use Bio::Seq;
use Bio::SeqIO;
use Bio::Restriction::Analysis;
use Bio::Restriction::EnzymeCollection;
use List::Util qw(min max sum);
my $usage = "perl $0 <in><out>";
my $in = shift or die $usage;
my $out = shift or die $usage;
my ($ra, $seq);
my $enz_collection = Bio::Restriction::EnzymeCollection->new();
my (@ecorbp, @ecorsize, @mspbp, @mspsize);
my $seq_in = Bio::SeqIO->new(
-format => 'fasta',
-file => $in,
);
open OUT, ">$out";
while ($seq = $seq_in -> next_seq)
{
$ra = Bio::Restriction::Analysis->new(-seq=>$seq);
#how many times does each enzyme cut
my $EcoRIcuts = $ra->cuts_by_enzyme('EcoRI');
my $MspIcuts = $ra->cuts_by_enzyme('MspI');
my $bothcuts = $ra->cuts_by_enzyme('EcoRI', 'MspI');
#print $bothcuts."\n";
#doing something with bp
@ecorbp = $ra->positions('EcoRI');
@mspbp = $ra->positions('MspI');
my $ecorbp_min = min(@ecorbp);
my $ecorbp_max = max(@ecorbp);
my $mspbp_min = min(@mspbp);
my $mspbp_max = max(@mspbp);
my $ecorbp_avg = sum(@ecorbp)/@ecorbp;
my $mspbp_avg = sum(@mspbp)/@mspbp;
#doing something with cut size/fragment
@ecorsize = $ra->sizes('EcoRI');
@mspsize = $ra->sizes('MspI');
my $ecorsize_min = min(@ecorsize);
my $ecorsize_max = max(@ecorsize);
my $mspsize_min = min(@mspsize);
my $mspsize_max = max(@mspsize);
my $ecorsize_avg = sum(@ecorsize)/@ecorsize;
my $mspsize_avg = sum(@mspsize)/@mspsize;
my @bothsize = $ra->sizes('EcoRI','MspI');
my $bothsize_max = max (@bothsize);
my $bothsize_min = min (@bothsize);
print $bothsize_max,"\n";
print $bothsize_min,"\n";
#do similar thing with the cut positions; this part not written yet
@bothbp = $ra->positions('EcoRI','MspI');
#print OUT "EcoRI cuts ",$seq -> display_id," $EcoRIcuts times with at minimum base position of $ecorbp_min, maximum base position of $ecorbp_max with average base position of $ecorbp_avg to give minimum frament size of $ecorsize_min bp, maximum fragment size of $ecorsize_max bp with average fragment size of $ecorsize_avg bp\n";
#print OUT "MspI cuts ",$seq -> display_id," $MspIcuts times with at minimum base position of $mspbp_min, maximum base position of $mspbp_max with average base position of $mspbp_avg to give minimum frament size of $mspsize_min bp, maximum fragment size of $mspsize_max bp with average fragment size of $mspsize_avg bp\n";
}
close OUT;
The problem comes when the code reaches the 'bothsize' part... printing numbers in decimals for some weird reason.
Here's my very short input file that I am using for testing the code:
>Seq1
AAAAGAATTCAAAAAAAAGAAAACAATAAGTAAAAGACAGACCGGCAGAGAAAACTTACCCGAC
>Seq2
AAACGAGAATTCAAAAAATAGAAAACCGGAATAAGTAAAAGACAGGAATTCAGCAGCCGGAGAA
The code is expected to report the position where both enzymes cut (similar to ecorbp
, mspbp
) & also the cut fragment size (similar to ecorsize
and mspsize
).
I would appreciate any help or pointers on how I can code this right. Thanks.