So, basically I want to read a file into a hash, but since the file is huge and doesn't fit into RAM I split it into chunks, process the data (the search_f2
sub) and read the next chunk of data. This seem to work, but of course it takes only one core.
Is there an easy way to fork the search_f2
sub?
I've tried a naive way with Parallel::Forkmanager
but it doesn't work as far as I see.
Any hint how to achieve that? I actually don't need to return from the forked sub, it would be sufficient if it prints the result to STDOUT.
The file1 structure is the following ( basically the result of tar -tf
command):
tarfile1.tar
gzip1.gz
<skip>
gzipX.gz
<skip>
tarfileX.tar
<some random number of gz files>
the file2 is just plain line break separated list of gzipX.gz files
the perl code:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use Data::Dumper;
use Parallel::ForkManager;
my $file1 = $ARGV[0] // die "Need a file as argument";
my $file2 = $ARGV[1] // die "Need a file as argument";
my $fd1 = read_f($file1);
my %hdata;
my $tarfile;
my $index = 0;
my $pm = Parallel::ForkManager->new(10);
while (my $line = <$fd1>) {
chomp $line;
if ( $line =~ m/^somepattern.*tar$/ ){
$tarfile = $line;
$index++;
}
if (++$index >= '100') {
say "\tForking a process";
my $pid = $pm->start and do {
$index = 0;
%hdata = ();
next;
};
search_f2(\%hdata);
$pm->finish;
}
push @{$hdata{$tarfile}},$line if $line =~ m/.*\.gz$/;
}
close $fd1;
#last search
search_f2(\%hdata);
sub search_f2{
my ($h) = @_;
my %hdata = %$h;
my $fd2 = read_f($file2);
while (my $ciffile = <$fd2>) {
chomp $ciffile;
foreach my $tarfile (keys %hdata) {
my $values = $hdata{$tarfile};
if (grep (/$ciffile/, @$values)) {
say "$tarfile";
delete $hdata{$tarfile};
last;
}
}
}
close $fd2;
return;
}
sub read_f {
my $file = shift;
die "Can't open file $file: $!\n" if ! -e $file;
# gzip happily parses plain files as well
open my $fh, "pigz -fdc $file|" or die "Can't open file $file: $!\n";
return $fh if $fh;
}