1

I have a legacy project which take a huge amount of data from STDIN and process it line by line in a perl script. The line order is not important. This is taking very long so I want to make it in parallel.

After a bit of research I found Parallel::Loops which seems suitable but I can't get it working because $_ is empty. My code is:

#Initialize all vars etc

$pl->while ( sub { <STDIN> }, sub {
    print $_       # but $_ is empty
}

Other ways of reading from STDIN ir parallel are wellcome too.


Update:

After all the help that I received I could manage some working piece of code, thank you. I'm going to do a brief abstract. To clarify:

  1. This is a kind of parser, it has more than 3000 lines with regex and conditions which were auto generated.

  2. The input that I use for testing is a POS tagged text, there are 1071406 lines in this file.

  3. My hardware is: SSD disc, mid range i5 last gen and 8gb RAM DDR4.

Conclusions:

  1. As the comments suggested IO operations make my script slow.
  2. All the suggestion resulted in improvements, specially the ones including processing bunch of lines instead of line by line.
  3. Answers contain very useful implementation of threading for future work.
  4. Framework Parallel::ForkManager introduce a lot of lag in the execution time. I always kill the script after 5min, since the script without parallelism takes about 6.
  5. Framework Parallel::Loops introduce a little improvement. The script takes about 3min to finish.
  6. Using GNU parallel is the easy way of optimizing.
  7. Using the package Threads I got the best time, 1min45secs but it is very close to GNU parallel so it's on you giving it a try, and making the effort of porting the code.
  8. Using the thread package as in the @ikegami answer reading bunch of lines, times were the same as the times got applying the @tanktalus solution, which read line by line.

Finally, I'm going with the @ikegami solution, which I think that will be better when the amount of data increase. I adjust the amount of lines to process to 100.000 because it gets better results than 10.000, for instance. This difference is matter of 8 secs aprox.

Next natural step is writting everything to files instead of using the STDOUT, I hope this helps to reduce the time a little bit more.

Iván Rodríguez Torres
  • 4,293
  • 3
  • 31
  • 47
  • 2
    Leave your code unchanged and wrap it in **GNU Parallel** maybe... `cat hugeData | parallel --pipe ./existingScript.pl`. Or *shebang-wrap* your existing script... https://www.gnu.org/software/parallel/parallel_tutorial.html#Shebang – Mark Setchell Jan 29 '17 at 21:40
  • `cat hugeData |` is better replaced with ` – ikegami Jan 30 '17 at 01:54
  • @MarkSetchell thanks. But time is a little bit worse than running the script as it is. – Iván Rodríguez Torres Jan 30 '17 at 09:59
  • Ok, hopefully it was worth a try. I guess the time is dominated by I/O rather than calculation, so parallelising the processing is not going to help much. – Mark Setchell Jan 30 '17 at 10:02
  • @MarkSetchell yep, it seems that you're right. Later I'm gonna try the script with the full text, which takes 1 week to finish. Maybe there we can see the diference. – Iván Rodríguez Torres Jan 30 '17 at 10:57

3 Answers3

4

$_ is never set because you never assign to $_!

Don't forget that

while (<STDIN>) { ... }

is short for

while (defined( $_ = <STDIN> )) { ... }

That means you were looking to use the following:

$pl->while ( sub { defined( $_ = <STDIN> ) }, sub {
    print $_;
}

That said, clobbering $_ is a bad idea. It could very well have been aliased to some other variable by a for (...) in the caller.

That means you should be using the following:

my $line;
$pl->while ( sub { defined( $line = <STDIN> ) }, sub {
    print $line;
}

You may find that breaking down the work into coarser units that lines will yield better performance as it reduces the overheard to work ratio.

use constant WORK_UNIT_SIZE => 100;

my $done = 0;
my @lines;
$pl->while ( sub {
    @lines = ();
    return 0 if $done;

    while (@lines < WORK_UNIT_SIZE) {
        my $line = <>;
        if (!defined($line)) {
            $done = 1;
            return 0+@lines;
        }

        push @lines, $line;
    }

    return 1;
}, sub {
    for (@lines) {
        print $_;
    }
}

Finally, rather than creating a new task for each work unit, you should reuse them! The following demonstrates this using threads.

use threads            qw( async );
use Thread::Queue 3.01 qw( );

use constant NUM_WORKERS    => 8;
use constant WORK_UNIT_SIZE => 100;

sub worker {
    my ($job) = @_;
    for (@$job) {
        print $_;
    }
}

my $q = Thread::Queue->new();
$q->limit(NUM_WORKERS * 4);

async { while (defined( my $job = $q->dequeue() )) { worker($job); } }
    for 1..NUM_WORKERS;

my $done = 0;    
while (!$done) {
    my @lines;
    while (@lines < WORK_UNIT_SIZE) {
        my $line = <>;
        if (!defined($line)) {
            $done = 1;
            last;
        }

        push @lines, $line;
    }

    $q->enqueue(\@lines) if @lines;
}

$q->end();
$_->join for threads->list;
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Thank you, this answer perfectly the question. The problem is that is slower than the script without parallelism. I made some test with a reduced input (about 40mb plain text) and takes more than 2min to finish while the original script takes 12secs. – Iván Rodríguez Torres Jan 30 '17 at 10:01
  • If the amount of work per line is small, of course it is. You need larger work units – ikegami Jan 30 '17 at 14:32
  • yep, you're right. The amount of work at this moment is small, just a couple of regex matching – Iván Rodríguez Torres Jan 30 '17 at 14:50
  • So read more lines at once. Have each worker process 10, 100, 1000 lines at a time. – ikegami Jan 30 '17 at 15:39
  • Added to my answer. – ikegami Jan 30 '17 at 20:27
  • First place, thank you for spending your time elaborating this. Second, It seems that threads need more working load. I was playing with the `num_lines_to_collect` and I get the best results with the value 100000 and this result is one second slower than the original script. I'm going to keep working on this and add working load to the job. I will post results. – Iván Rodríguez Torres Jan 31 '17 at 12:02
  • a bit late, but finally I could update the question to reflect my experience. I let you now just in case you were curious about the topic – Iván Rodríguez Torres Feb 07 '17 at 00:28
  • Hi, in the code you use: `async {...}`. How is that called? I mean was looking for it on Google but it seems that I an't found it. I tried things like: "async block perl", and in spite of I get results related I haven't got complete info about how it works. Is it the same as `Async->new(..)`? Thank you – Iván Rodríguez Torres Mar 09 '17 at 00:50
  • @Iván Rodríguez Torres, It's from the [threads](http://search.cpan.org/perldoc?threads) module. (`use threads qw( async );` should have made that clear!) `my $thread = async { ... };` is short for `my $thread = thread->create(sub { ... });` – ikegami Mar 09 '17 at 07:48
  • yep, you are right. Thank you for another doubt solved! – Iván Rodríguez Torres Mar 09 '17 at 09:09
1

I don't know about specific benefits to using Parallel::Loops (which there may well be). Here is the same with Parallel::ForkManager, which is what Parallel::Loops uses.

use warnings;
use strict;
use feature 'say';

use Parallel::ForkManager;   

my $max_procs = 30; 
my $pm = Parallel::ForkManager->new($max_procs);   

# Retrieve data returned by children in the callback
my %ret_data;      
$pm->run_on_finish( sub { 
    my ($pid, $exit, $ident, $signal, $core, $dataref) = @_; 
    $ret_data{$pid} = $dataref;
});

while (my $input = <STDIN>)
{
    chomp($input);

    $pm->start and next;
    my $ret = run_job($input);
    $pm->finish(0, \$ret);
}
$pm->wait_all_children;

foreach my $pid (keys %ret_data) {
    say "$pid returned: ${$ret_data{$pid}}";
}

sub run_job { 
    my ($input) = @_; 
    # your processing
    return $input;    # to have something to check
} 

This code returns a scalar from a child process, a single value. You can return any data structure, see Retrieving data structures from child processes in docs and this post for an example.

Data is returned via files and that can slow things down for large data or many quick processes.

If testing at a terminal stop input with Ctrl-d (or add last if $input !~ /\S/; after chomp to stop with empty line -- but not with data being passed to STDIN by other means).


It is clarified that each STDIN read is just one line to process. Then we should collect more lines before spawning a new process, otherwise there is way too much overhead.

my $num_lines_to_collect = 1000;

my @lines_to_process;         # collect lines for each fork

while (my $input = <STDIN>)
{
    chomp($input);
    push @lines_to_process, $input;
    next if $. % $num_lines_to_collect != 0;

    $pm->start and next;
    my $ret = run_job( \@lines_to_process );
    $pm->finish(0, \$ret);

    @lines_to_process = ();   # empty it for the next round
}
$pm->wait_all_children;

We add lines to the array @lines_to_process and proceed to triger a new fork only when the current line number $. is a multiple of $num_lines_to_collect. So a job is started for every $num_lines_collect, so each job processes that much. I set it to 1000, experiment.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • I don't need the return since I print to STDOUT so I removed that. The solution works, so I upvote. The problem of this is the same as in the accepted answer, this is slower than the script without parallelism. I made some test with a reduced input (about 40mb plain text) and takes more than 2min to finish while the original script takes 12secs. But I think that this should be a new qestion. Thank you – Iván Rodríguez Torres Jan 30 '17 at 10:55
  • I thought that "input" is more than just one line to process, that it triggers a job. If it is one little thing to do than spawing a process each time is a huge overhead. So read a bunch of lines, than fire a process. I added code for that. – zdim Jan 30 '17 at 17:42
  • First place, thank you for spending your time elaborating this. Second, It seems that threads need more working load. I was playing with the `num_lines_to_collect` and I get the best results with the value 100000 and this result is one second slower than the original script. I'm going to keep working on this and add working load to the job. I will post results. – Iván Rodríguez Torres Jan 31 '17 at 12:02
  • @IvánRodríguezTorres Hm. It may be that there is just too little to do with each line so that reading from the disk makes up most of processing time. Still, I'd expect that having multiple jobs would bring _some_ gain. Try from the other end -- collect so many lines that there are just two jobs first, then three, etc. // How big are your files (how many lines), and what (roughly) is done to each line? You mentioned "few regex" ... is that it? // Are you on some very old/poor hardware? – zdim Jan 31 '17 at 18:56
  • @IvánRodríguezTorres Also, you say "_I print to STDOUT_" ... do you print _every line_ to `STDOUT` as it's processed? If yes, change that. Have each job collect resutls and then print them out, and to a file instead. (Each job to its own file, which can then be merged.) Your whole thing appears I/O bound as it is, and adding a lot of (_slow_) prints to `STDOUT` doesn't help. – zdim Jan 31 '17 at 19:02
  • @IvánRodríguezTorres Let's clear up what we can here, and then this may well be a good question in its own right. (Unless we do resolve it here.) – zdim Jan 31 '17 at 19:05
  • thanks again for your time. Until tomorrow I won't be able to do more tests but tomorrow I'm goingo to apply all your suggestions and I will keep you up. // The size of the files is variable but there are several with more than 50GB of plain text. I was making the tests with a reduced collection of 40MB. The job essentially is a kind of sintactic analyzer, so each line goes through a series of if-else statements and some regular expressions. // The hardware is fine for this, SSD disc, mid range i5 last gen and 8gb RAM. – Iván Rodríguez Torres Jan 31 '17 at 19:18
  • I print to STDOUT because this script output but I think that I can change the script ot write to files. This is script is the first part of a big process, that is launched similar to this: `cat mybigtext | thisScript.perl | anotherPart.perl |...` the output of one script is the input for the next. So anotherPart.perl is going to take the STDIN, which is generated from thisScript.perl and so on. So if I follow your advices I think I can change this behavior and replace the STDIN and STDOUT by files. The problem of this will be the amount of GB spent on thes temp files. Anyway it worths a try – Iván Rodríguez Torres Jan 31 '17 at 19:26
  • @IvánRodríguezTorres Ah, I see. All that output going to a single-file road (input for the next script) doesn't help. If multiple jobs print their own file each there is at least some time savings, since it does happen that while some are writing some are processing. Some concurrency. This is I/O heavy and to improve it by multitasking you'll have to wrestle it out, I think. I'd try: Set up a moderate number of jobs, say 4--10, and have each write to its own file. If that helps, you can replace the pipeline with a simple script, `s1.pl < input; s2.pl < s1_output; ...`, like you say. – zdim Jan 31 '17 at 19:35
  • sure. I will give it a try tomorrow, I hope. I'll keep you informed. Thank you – Iván Rodríguez Torres Jan 31 '17 at 19:40
  • a bit late, but finally I could update the question to reflect my experience. I let you now just in case you were curious about the topic. – Iván Rodríguez Torres Feb 07 '17 at 00:25
  • @IvánRodríguezTorres Thank you very much for letting me know! Yes, I was very curious. Good that you got it to speed up :). One more little suggestion -- just for the sake of knowing, if it's easy to try it would be good to try with Parallel::ForkManager (not Parallel::Loops). I don't know of any particular faults of the other one, but it does use the P::FM and it builds on top of it. Which is why I went with the source, P::FM. While I don't know why the other one would introduce such a slowdown I really can't see why P::FM would. – zdim Feb 07 '17 at 00:40
  • @IvánRodríguezTorres That's a very nice update, but one suggestion while you are at it -- add what your final solution was. How many lines to process at once? Print to individual files from each job or not? – zdim Feb 07 '17 at 00:44
  • Good advices, thanks again. I'm going to update later. Maybe tomorrow, but I'll came back – Iván Rodríguez Torres Feb 07 '17 at 00:46
  • I think that with the content of this questions we can create a parallelism topic on perl on the documentation section, would it be useful? – Iván Rodríguez Torres Feb 07 '17 at 00:55
  • @IvánRodríguezTorres I am not familiar with how docs work, but it seems like a good idea. Btw, while threads have (some) advantages over forked processes, my reaction to forks being by a factor of 5 slower is that something is wrong with how forking is used. It should not be slower by much, if at all. (While the overhead matters here, the threads as implemented by Perl are not the light-weight threads as one may expect so I don't see why they would be better. The solution by ikegami reuses them, what directly helps, but I don't see how that would gain a factor of 4 or 5.) – zdim Feb 07 '17 at 01:09
  • I made a mistake in the previus update. I was speaking of ForkManager instead of Loops. Now it's correct. And yes, is possible that I have some mistake using ForkManager but I used your code as starting point and I double check everything that I added, and it seems ok. This said, maybe there are some error but I don't spot it. I'm with you, that it's something weird. – Iván Rodríguez Torres Feb 10 '17 at 11:49
  • @IvánRodríguezTorres Nice summary, and thanks for response. The P::FM shouldn't be slower (by much anyway), perhaps it needs fewer proceses and more to do in each. But it doesn't matter, you got a good solution :) By your descritpion, I think that writing to files should speed it up considerably. Good work :) – zdim Feb 10 '17 at 17:29
  • Thank you, maybe I will make it work with files, since I'm going to keep working with these scripts. If so, I'll update this again. – Iván Rodríguez Torres Feb 10 '17 at 20:53
0

Probably the simplest manner here is to create a pool of threads, each one listening on the same queue, and then have one thread (probably the main thread) read the file and push each line on to the queue.

use strict;
use warnings;
use Thread qw(async);
use Thread::Queue;

my $q = Thread::Queue->new();
$q->limit(32); # no point in reading in more than this into memory.

my @thr = map {
    async {
        while (defined (my $line = $q->dequeue()) ) {
            print $line;
        }
    };
} 1..4; # 4 worker threads

while (<STDIN>)
{
    $q->enqueue($_);
}
$q->end();

$_->join for Thread->list;

Just as a point of warning, beware if you need to push data from the worker threads back to the main thread. It's not as trivial as in other languages.

Update: switch from threads to Thread. While the async function is documented as returning thread objects, that didn't seem to work for me, so had to change the join as well.

Tanktalus
  • 21,664
  • 5
  • 41
  • 68
  • I haven't accepted the answer yet because after applying your solution the script throws errors than had never appear before. If I manage to solve them I will accept since your queue systems looks pretty useful. – Iván Rodríguez Torres Jan 29 '17 at 21:08
  • I don't feel I have enough recent experience in Perl to downvote, but the `threads` module is *officially* discouraged. They are not as lightweight as you would expect. (I do recall trying to use them in an I/O-bound project about 7 years ago, and they made the code about 5x *slower*.) – chepner Jan 29 '17 at 22:37
  • @chepner Given that the use is "discouraged", as we all know, I keep thinking about uses which are still reasonable. (Threads do offer specific benefits, after all.) So, I am curious -- in your project, they were "_5x slower_" ... in comparison to what? Sequential processing, or forking, or ...? Btw, I'd like to know about reasons for downvote, too. – zdim Jan 29 '17 at 23:13
  • @chepner, Re *they made the code about 5x slower*", You did something seriously wrong, then. That's why the warning is there. Multi-tasking is hard. – ikegami Jan 30 '17 at 02:01
  • @chepner, Re "*the threads module is officially discouraged*", They are discouraged for the reason stated in the previous paragraph: They are heavy and multi-tasking is complicated. The same applies to Parallel::Loops. It would be just as discouraged by that same author. – ikegami Jan 30 '17 at 02:04
  • Thank you for this answer, but this solution is two time slower than the original script. – Iván Rodríguez Torres Jan 30 '17 at 10:03
  • And you said the original script is 120x slower than the original script. The difference between 120x slower and 240x slower is actually very small. The issue, again, is that you do far too little per work unit. Read more lines at once! – ikegami Jan 30 '17 at 15:42
  • @chepner and Tanktalus, a bit late, but finally I could update the question to reflect my experience. I let you now just in case you were curious about the topic – Iván Rodríguez Torres Feb 07 '17 at 00:40