I have a legacy project which take a huge amount of data from STDIN and process it line by line in a perl script. The line order is not important. This is taking very long so I want to make it in parallel.
After a bit of research I found Parallel::Loops
which seems suitable but I can't get it working because $_
is empty. My code is:
#Initialize all vars etc
$pl->while ( sub { <STDIN> }, sub {
print $_ # but $_ is empty
}
Other ways of reading from STDIN ir parallel are wellcome too.
Update:
After all the help that I received I could manage some working piece of code, thank you. I'm going to do a brief abstract. To clarify:
This is a kind of parser, it has more than 3000 lines with regex and conditions which were auto generated.
The input that I use for testing is a POS tagged text, there are 1071406 lines in this file.
My hardware is: SSD disc, mid range i5 last gen and 8gb RAM DDR4.
Conclusions:
- As the comments suggested IO operations make my script slow.
- All the suggestion resulted in improvements, specially the ones including processing bunch of lines instead of line by line.
- Answers contain very useful implementation of threading for future work.
- Framework Parallel::ForkManager introduce a lot of lag in the execution time. I always kill the script after 5min, since the script without parallelism takes about 6.
- Framework Parallel::Loops introduce a little improvement. The script takes about 3min to finish.
- Using GNU parallel is the easy way of optimizing.
- Using the package Threads I got the best time, 1min45secs but it is very close to GNU parallel so it's on you giving it a try, and making the effort of porting the code.
- Using the thread package as in the @ikegami answer reading bunch of lines, times were the same as the times got applying the @tanktalus solution, which read line by line.
Finally, I'm going with the @ikegami solution, which I think that will be better when the amount of data increase. I adjust the amount of lines to process to 100.000 because it gets better results than 10.000, for instance. This difference is matter of 8 secs aprox.
Next natural step is writting everything to files instead of using the STDOUT, I hope this helps to reduce the time a little bit more.