1

I have an example DNA sequence like: S = ATGCGGGCGTGCTGCTGGGCTGCT.... of length 5MB size. Also, I have the gene coordinates for each gene like:

Gene no. Start End
1          1    50
2         60    100
3         110   250
.....
4000      4640942 4641628 

My goal is to perform a certain calculation for every gene start position. My code is working perfectly. However, it is quite slow. I have gone through many help pages for making it faster using threads but unfortunately couldn't figure out.

Here's the summary of my code:

foreach my $gene($sequence){
     my @coordinates = split("\t",$gene);
     $model1 = substr($sequence, $coordinates[1], 50);
     $model2 = substr($sequence, $coordinates[1], 60);
     $c-value = calculate($model1, $model2);
     ....
}

sub calculate {
     ......
}

I would really appreciate if anyone can suggest me how to parallelize this kind of programs. What I want to parallelize is the calculation of c-value between model1 and model2 for each gene, which ultimately will fasten the process. I have tried using Threads::queue but ended with a bunch of errors. I'm fairly new to Perl programming so any help is highly appreciated.

Thank you, everyone, for your comments and suggestions. I have modified the code and it seems to be working using the Perl module Parallel::ForkManager. The code is successfully using all of the 4 cores of my computer.

Here's the modified code:

    use strict;
    use warnings;
    use Data::Dumper;
    use Parallel::ForkManager;
    my $threads = 4;
    my $pm = new Parallel::ForkManager($threads);
    my $i = 1; #gene number counter
    $pm -> run_on_finish( sub { $i++; print STDERR "Checked $i genes" if ($i % $number_of_genes == 0); } ); 
    my @store_c_value = ();
    foreach my $gene($sequence){
                 my $pid = $pm->start and next;
                 my @coordinates = split("\t",$gene);
                 my $model1 = substr($sequence, $coordinates[1], 50);
                 my $model2 = substr($sequence, $coordinates[1], 60);
                 my $c-value = calculate($model1, $model2);
                 push(@store_c_value, $c-value);
                 $i++;
                 $pm->finish;
            }
    $pm->wait_all_children;
            sub calculate {
                 ......
                 return ($c-value);
            }
    print Dumper \@store_c_value;

The current issue is I'm not getting any output for @store_c_value(i.e. empty array). I found that you can't store the data from a child process to an array, which was declared in the main program. I know I can print it to an external file, but I want this data to be in the @store_c_value array as I'm using it again later on in the program.

Thank you again for helping me out.

  • 1
    Try `AnyEvent::Fork` https://metacpan.org/pod/AnyEvent::Fork. – UjinT34 Oct 28 '18 at 08:50
  • Step one for any optimization is to measure. Then, use a profiler to identify how much time is spent where exactly. Then, and only then, think about different approcahes to optimize. – Ulrich Eckhardt Oct 28 '18 at 10:06
  • Please show more of your code, not everyone on the perl tag is familiar with bioinformatics, but may still help you with parallelization if given more details.. – Håkon Hægland Oct 28 '18 at 13:52
  • One error may be because `$c-value` is not a valid Perl identifier. If you're using strict and warnings (you should) this will cause an error due to the value bareword. Only word characters (letters, digits, underscore) are valid in Perl identifiers. – Grinnz Oct 28 '18 at 19:33
  • Also `foreach my $gene($sequence){` doesn't make sense, it will only run once because you provided one scalar value for it to iterate over. – Grinnz Oct 28 '18 at 19:53
  • Thank you for suggestions and comments. I have modified the code. Kindly see, I have edited the main post. – Soham Sengupta Oct 29 '18 at 01:20
  • No, you can't write from a child to parent directly. For complete examples of how to pass data back with `Parallel::ForkManager` see [this post](https://stackoverflow.com/a/41891334/4653379) (and a link in it, and linked documentation). – zdim Oct 29 '18 at 07:38
  • Your code still has all of the same problems I pointed out. Parallel::ForkManager and IO::Async::Function both provide mechanisms for returning the result of the child process to the parent for you. – Grinnz Oct 29 '18 at 19:16

1 Answers1

1

One option is IO::Async::Function, which will use forks or threads depending what OS you're on (forking is much more efficient on Unixy systems), and maintain a set of worker processes/threads to run the code in parallel. It returns Future instances which can be used to synchronize the async code as you need. There are lots of ways to use Future, a couple are presented below.

use strict;
use warnings;
use IO::Async::Loop;
use IO::Async::Function;
use Future;

my $loop = IO::Async::Loop->new;
# additional options can be passed to the IO::Async::Function constructor to control how the workers are managed
my $function = IO::Async::Function->new(code => \&calculate);
$loop->add($function);

my @futures;
foreach my $gene($sequence){
     my @coordinates = split("\t",$gene);
     my $model1 = substr($sequence, $coordinates[1], 50);
     my $model2 = substr($sequence, $coordinates[1], 60);
     push @futures, $function->call(args => [$model1, $model2])->on_done(sub {
         my $c_value = shift;
         # further code using $c_value must be here, to be run once the calculation is done
     })->on_fail(sub {
         warn "Error in calculation for $gene: $_[0]\n";
     });
}

# wait for all calculations and on_done handlers before continuing
Future->wait_all(@futures)->await;

If you want the program to stop immediately if there's an exception in one of the calculations, you can use needs_all and remove the individual on_fail handlers, and use get which is a wrapper of await that will then return all of the c-values in order if they succeed, or throw an exception if one fails.

my @c_values = Future->needs_all(@futures)->get;
Grinnz
  • 9,093
  • 11
  • 18