2

So, basically I want to read a file into a hash, but since the file is huge and doesn't fit into RAM I split it into chunks, process the data (the search_f2 sub) and read the next chunk of data. This seem to work, but of course it takes only one core. Is there an easy way to fork the search_f2 sub? I've tried a naive way with Parallel::Forkmanager but it doesn't work as far as I see. Any hint how to achieve that? I actually don't need to return from the forked sub, it would be sufficient if it prints the result to STDOUT. The file1 structure is the following ( basically the result of tar -tf command):

tarfile1.tar
gzip1.gz
<skip>
gzipX.gz
<skip>
tarfileX.tar
<some random number of gz files>

the file2 is just plain line break separated list of gzipX.gz files

the perl code:

#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use Data::Dumper;
use Parallel::ForkManager;

my $file1 = $ARGV[0] // die "Need a file as argument";
my $file2 = $ARGV[1] // die "Need a file as argument";

my $fd1 = read_f($file1);
my %hdata;
my $tarfile;
my $index = 0;

my $pm = Parallel::ForkManager->new(10);
while (my $line = <$fd1>) {
    chomp $line;
    if ( $line =~ m/^somepattern.*tar$/ ){
        $tarfile = $line;
        $index++;
    }
    if (++$index >= '100') {
        say "\tForking a process";
        my $pid = $pm->start and do {
            $index = 0;
            %hdata = ();
            next;
        };
        search_f2(\%hdata);
        $pm->finish;
    }
    push @{$hdata{$tarfile}},$line if $line =~ m/.*\.gz$/;
}
close $fd1;

#last search
search_f2(\%hdata);

sub search_f2{
    my ($h) = @_;
    my %hdata = %$h;
    my $fd2 = read_f($file2);
    while (my $ciffile  = <$fd2>) {       
        chomp $ciffile;                   
        foreach my $tarfile (keys %hdata) {  
            my $values = $hdata{$tarfile};   
            if (grep (/$ciffile/, @$values)) {
                say "$tarfile";
                delete $hdata{$tarfile};
                last;
            }
        }
    }
    close $fd2;
    return;
}

sub read_f {
    my $file = shift;
    die "Can't open file $file: $!\n" if ! -e $file;
    # gzip happily parses plain files as well
    open my $fh, "pigz -fdc $file|" or die "Can't open file $file: $!\n";
    return $fh if $fh;
}
mestia
  • 450
  • 2
  • 7
  • First, this is how I read it: -- read a number of lines (100) and fork a process that runs a sub to process that batch of lines, and keep the pool at 10. Is that correct? If so, then yes why not, that'd be a good way to go, if that processing does take a little time so overhead won't kill the benefit. – zdim Oct 12 '22 at 19:15
  • Exactly, read up to 100 tar files and build the hash tarfile->listof gz files, then fork and take the next chunk. However I do not see the forked processes for whatever reason. I have feeling that I messed up the Parallel::Forkmanager syntax. – mestia Oct 12 '22 at 19:38

1 Answers1

3

I take the quest to be the following: read a certain number of lines from a file and process each such chunk of text in its own fork. I'm not quite sure of some details in the question so here is a basic demo, hopefully to serve as a template.

Keep the number of processes to 3, and process batches ("chunks") of 2 lines in each.

use warnings;
use strict;
use feature qw(say state);
use Parallel::ForkManager;

my $file = shift // die "Usage: $0 filename\n";

my $pm = Parallel::ForkManager->new(3);

open my $fh, '<', $file or die $!; 

my ($chunk, $num_lines);

while (my $line = <$fh>) {
    chomp $line;
    say "Processing line: |$line|";

    $chunk .= $line;

    if (++$num_lines >= 2) {
        say "\tForking a process";

        $pm->start and do {
            $num_lines = 0;
            $chunk = ''; 
            next;
        };
        proc_chunk($chunk);
        $pm->finish;
    }   
}
$pm->wait_all_children;

sub proc_chunk {
    my ($chunk) = @_; 
    my $line_nos = join ' ', $chunk =~ /#([0-9]+)/g; 
    say "\t\tin a fork, processing chunk with lines: $line_nos";
    sleep 10; 
    say "\t\t\t... done with fork";
}

In P::FM, that $pm->start and next; forks a process and the parent immediately jumps to the next iteration of the loop. So any needed resetting of variables need be done right there and I use a do { ... } block for it.

The sub sleeps so that we can see the group of forks exiting practically together, what happens because the processing here is so quick. (In fact P::FM forks a new process as soon as one is finished, to keep the given number running. It doesn't wait for the whole batch to finish first, unless wait_for_available_procs is set up, see an example and a lot more detail here.)

This prints

Processing line: |This is line #1|
Processing line: |This is line #2|
        Forking a process
Processing line: |This is line #3|
Processing line: |This is line #4|
        Forking a process
Processing line: |This is line #5|
Processing line: |This is line #6|
        Forking a process
Processing line: |This is line #7|
Processing line: |This is line #8|
        Forking a process
                in a fork, processing chunk with lines: 1 2
                in a fork, processing chunk with lines: 3 4
                in a fork, processing chunk with lines: 5 6
                        ... done with fork
                        ... done with fork
                        ... done with fork
Processing line: |This is line #9|
Processing line: |This is line #10|
        Forking a process
Processing line: |This is line #11|
Processing line: |This is line #12|
        Forking a process
Processing line: |This is line #13|
Processing line: |This is line #14|
        Forking a process
                in a fork, processing chunk with lines: 7 8
                in a fork, processing chunk with lines: 9 10
                in a fork, processing chunk with lines: 11 12
                        ... done with fork
                        ... done with fork
                        ... done with fork
Processing line: |This is line #15|
Processing line: |This is line #16|
        Forking a process
Processing line: |This is line #17|
Processing line: |This is line #18|
        Forking a process
Processing line: |This is line #19|
                in a fork, processing chunk with lines: 13 14
Processing line: |This is line #20|
        Forking a process
                in a fork, processing chunk with lines: 15 16
                in a fork, processing chunk with lines: 17 18
^C
[...etc...]

The original version of the question didn't have that do { } block but had ... and next, thus with parent directly bailing out. (I see that the question was now edited to include it.)

zdim
  • 64,580
  • 5
  • 52
  • 81