Perl - turning foreach loop to a multi-threaded run

Question

I have the following code:

foreach my $inst (sort keys %{ ... }) {
       next if (...)

       somefuntion($a, $b, $c, $inst);
}

I would like to run this function on all the $inst-s asynchronously. I tried to make it multi-threaded, but I'm having trouble with the syntax or implementation.

*** EDIT: ***

Apparently (i haven't noticed until now), the function uses a hash and the updates gets lost. Should Threads::shared help in this case? Is it relevant in this case or should I just try forks?

score 2 · Answer 1 · answered Aug 12 '21 at 14:46

Perl's got three major ways I'd suggest to do parallel code

Threads
Forks
Nonblocking IO

The latter isn't strictly speaking 'parallel' in all circumstances, but it does let you do multiple things at the same time, without waiting for each to finish, so it's beneficial in certain circumstances.

E.g. maybe you want to open 10 concurrent ssh sessions - you can just do an IO::Select to find which of them are 'ready' and process them as they come in.

The ssh shells themselves are of course, separate processes.

But when doing parallel, you need to be aware of a couple of pitfalls - one being 'self denial of service' - you can generate huge resource consumption very easily. The other being that you've got some inherent race conditions, and no longer a deterministic flow of program - that brings you a whole new class of exciting bugs.

Threads

I wouldn't advocate spawning a thread-per-instance, as that scales badly. Threads in perl are NOT lightweight, like you might be assuming. That means that implementing them as if they are, gives you a denial of service condition.

What I'd typically suggest is running with Thread::Queue and some "worker" threads - and use the Queue to pass data to some number of workers that are scaled to your resource availability. Depending on what is your limiting factor here that's making you do parallel.

(e.g. disk, network, cpu, etc.)

So to use a simplistic example that I've posted previously:

#!/usr/bin/perl

use strict;
use warnings;

use threads;

use Thread::Queue;

my $nthreads = 5;

my $process_q = Thread::Queue->new();
my $failed_q  = Thread::Queue->new();

#this is a subroutine, but that runs 'as a thread'.
#when it starts, it inherits the program state 'as is'. E.g.
#the variable declarations above all apply - but changes to
#values within the program are 'thread local' unless the
#variable is defined as 'shared'.
#Behind the scenes - Thread::Queue are 'shared' arrays.

sub worker {
    #NB - this will sit a loop indefinitely, until you close the queue.
    #using $process_q -> end
    #we do this once we've queued all the things we want to process
    #and the sub completes and exits neatly.
    #however if you _don't_ end it, this will sit waiting forever.
    while ( my $server = $process_q->dequeue() ) {
        chomp($server);
        print threads->self()->tid() . ": pinging $server\n";
        my $result = `/bin/ping -c 1 $server`;
        if ($?) { $failed_q->enqueue($server) }
        print $result;
    }
}

#insert tasks into thread queue.
open( my $input_fh, "<", "server_list" ) or die $!;
$process_q->enqueue(<$input_fh>);
close($input_fh);

#we 'end' process_q  - when we do, no more items may be inserted,
#and 'dequeue' returns 'undefined' when the queue is emptied.
#this means our worker threads (in their 'while' loop) will then exit.
$process_q->end();

#start some threads
for ( 1 .. $nthreads ) {
    threads->create( \&worker );
}

#Wait for threads to all finish processing.
foreach my $thr ( threads->list() ) {
    $thr->join();
}

#collate results. ('synchronise' operation)
while ( my $server = $failed_q->dequeue_nb() ) {
    print "$server failed to ping\n";
}

This will start 5 threads, and queue up some number of jobs, such that 5 are running in parallel at any given time, and 'unwind' gracefully after.

Forking

Parallel::Forkmanager is the tool for the job here.

Unlike threads, forks are quite efficient on a Unix system, as the native fork() system call is well optimised.

But what it's not so good at is passing data around - you've got to hand roll any IPCs between your forks in a way that you don't so much with Threads.

A simple example of this would be:

#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;

my $concurrent_fork_limit = 4;

my $fork_manager = Parallel::ForkManager->new($concurrent_fork_limit);

foreach my $thing ( "fork", "spoon", "knife", "plate" ) {
    my $pid = $fork_manager->start;
    if ($pid) {
        print "$$: Fork made a child with pid $pid\n";
    } else {
        print "$$: child process started, with a key of $thing ($pid)\n";
    }
    $fork_manager->finish;
}

$fork_manager->wait_all_children();

This does spawn off subprocesses, but cleans up after them fairly readily.

Nonblocking IO

Using IO::Select you would open some number of filehandles to subprocesses, and then use the can_read function to process the ones that are ready to run.

The perldoc IO::Select covers most of the detail here, which I'll reproduce for convenience:

use IO::Select;

$select = IO::Select->new();

$select->add(\*STDIN);
$select->add($some_handle);

@ready = $select->can_read($timeout);

@ready = IO::Select->new(@handles)->can_read(0);

I'd order thread code differently: start worker threads, enqueue, end, dequeue, join. Less overhead this way. You can even limit the size of the queues this way. — ikegami, Aug 12 '21 at 22:06
Thanks for your reply! I actually followed your post here: https://www.perlmonks.org/?node_id=1068673 And I don't understand what actually happens in the threading... You looped 5 times and ran the ping on all servers. Was the rational to 5 times ping all servers in the process_q? When does the process_q "fill up" again? (Hope my question is clear) — urie, Aug 15 '21 at 08:50
@urie the process queue 'fills up' via the `enqueue` method - provided you haven't ended he queue. In that example, you 'just' queue the contents of a file, then call `end` to let things unwind when they're done. But you can leave the queue open until you've no more input just fine. I've done this in workers where they're generating additional queue items themselves for example. — Sobrique, Aug 18 '21 at 08:36

Ted Lyngmo · Answer 2 · 2021-08-12T15:21:15.273

You could use threads.

Here's an example that should take about 5 seconds to finish although it calls sleep(5) twice:

#!/usr/bin/perl

use strict;
use warnings;

use threads;

my %data = (
    'foo' => 'bar',
    'apa' => 'bepa',
);

sub somefuntion {
    my $key = shift;
    print "$key\n";
    sleep(5);
    return $data{$key};
}

my @threads;
for my $inst (sort keys %data) {
    push @threads, threads->create('somefuntion', $inst);
}

print "running...\n";

for my $thr (@threads) {
    print $thr->join() . "\n";
}

print "done\n";

This answer was made to show how threads works in Perl because you mentioned threads. Just a word of caution:

The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that makes them easy to misuse. Few people know how to use them correctly or will be able to provide help.

The use of interpreter-based threads in perl is officially discouraged.

I would urge caution with threads - somewhat counterintuitively, they're _not_ lightweight, so with a large number of parallel items, you'll end up incredibly resource greedy and inefficient. forks (if on Unix) are very lightweight. and you can do a 'worker threads' model and that is more efficient. — Sobrique, Aug 12 '21 at 14:32

Perl - turning foreach loop to a multi-threaded run

2 Answers2

Threads

Forking

Nonblocking IO

Linked