I have an example DNA sequence like:
S = ATGCGGGCGTGCTGCTGGGCTGCT....
of length 5MB size.
Also, I have the gene coordinates for each gene like:
Gene no. Start End
1 1 50
2 60 100
3 110 250
.....
4000 4640942 4641628
My goal is to perform a certain calculation for every gene start position. My code is working perfectly. However, it is quite slow. I have gone through many help pages for making it faster using threads but unfortunately couldn't figure out.
Here's the summary of my code:
foreach my $gene($sequence){
my @coordinates = split("\t",$gene);
$model1 = substr($sequence, $coordinates[1], 50);
$model2 = substr($sequence, $coordinates[1], 60);
$c-value = calculate($model1, $model2);
....
}
sub calculate {
......
}
I would really appreciate if anyone can suggest me how to parallelize this kind of programs. What I want to parallelize is the calculation of c-value between model1 and model2 for each gene, which ultimately will fasten the process. I have tried using Threads::queue but ended with a bunch of errors. I'm fairly new to Perl programming so any help is highly appreciated.
Thank you, everyone, for your comments and suggestions. I have modified the code and it seems to be working using the Perl module Parallel::ForkManager. The code is successfully using all of the 4 cores of my computer.
Here's the modified code:
use strict;
use warnings;
use Data::Dumper;
use Parallel::ForkManager;
my $threads = 4;
my $pm = new Parallel::ForkManager($threads);
my $i = 1; #gene number counter
$pm -> run_on_finish( sub { $i++; print STDERR "Checked $i genes" if ($i % $number_of_genes == 0); } );
my @store_c_value = ();
foreach my $gene($sequence){
my $pid = $pm->start and next;
my @coordinates = split("\t",$gene);
my $model1 = substr($sequence, $coordinates[1], 50);
my $model2 = substr($sequence, $coordinates[1], 60);
my $c-value = calculate($model1, $model2);
push(@store_c_value, $c-value);
$i++;
$pm->finish;
}
$pm->wait_all_children;
sub calculate {
......
return ($c-value);
}
print Dumper \@store_c_value;
The current issue is I'm not getting any output for @store_c_value
(i.e. empty array). I found that you can't store the data from a child process to an array, which was declared in the main program.
I know I can print it to an external file, but I want this data to be in the @store_c_value
array as I'm using it again later on in the program.
Thank you again for helping me out.