1

I need to access in a main program the contents of hashes that were generated via subroutines that were forked. Here specifically is what I am trying to do:-

use Benchmark;
use File::Find;
use File::Basename;
use File::Path;
use Data::Dumper;
use strict;
use warnings;

print "Process ID: $$ \n";
my @PartitionRoots = qw(/nfs/dir1 /nfs/dir2 /nfs/dir3 /nfs/dir4);
my @PatternsToCheck = qw(prefix1 prefix2);
my @MatchedDirnames = qw();
my $DirCount = 0;
my $Forks = 0;
my @AllDirs = qw();
my %SweepStats = ();

foreach my $RootPath (@PartitionRoots) {
    foreach my $Pattern (@PatternsToCheck) {
        if (grep {-e} glob ("$RootPath/$Pattern*")) {
            my @Files = glob ("$RootPath/$Pattern*");
            foreach my $FileName (@Files) {
                if (-d $FileName) {
                    $DirCount++;
                    push (@AllDirs, $FileName);
                    my $PID = fork;
                    if (not defined $PID) {
                        warn 'Could not fork!\n';
                        next;
                    }
                    if ($PID) {
                        $Forks++;
                        print "In the parent PID ($$), Child pid: $PID Number of forked child processes: $Forks\n";
                    } else {
                        print "In the child PID ($$)\n";
                        find(\&file_stats, $FileName);
                        print "Child ($$) exiting...\n";
                        exit;
                    }
                }
            }
        }
    }
}
for (1 .. $Forks) {
   my $PID = wait();
   print "Parent saw child $PID exit.\n";
}
print "Parent ($$) ending.\n";

print Dumper (\%SweepStats);
foreach my $DirName (@AllDirs) {
    print ("Printing $DirName contents...\n");
    foreach (@{$SweepStats{$DirName}}) {
        my $uname = $_->{uname};
        my $mtime = $_->{mtime};
        my $size = $_->{size};
        my $file = $_->{file};
        print ("$uname $mtime $size $file\n");
    }
}

sub file_stats {
    if (-f $File::Find::name) {
        my $FileName = $_;
        my $PathName = dirname($_);
        my $DirName = basename($_);     
        my $uid = (stat($_))[4];
        my $uname = getpwuid($uid);
        my $size = (stat($_))[7];
        my $mtime = (stat($_))[9];
        if (defined $uname && $uname ne '') {
            push @{$SweepStats{$FileName}}, {path=>$PathName,dir=>$DirName,uname=>$uname,mtime=>$mtime,size=>$size,file=>$File::Find::name};
        } else {
            push @{$SweepStats{$FileName}}, {path=>$PathName,dir=>$DirName,uname=>$uid,mtime=>$mtime,size=>$size,file=>$File::Find::name};
        }
    }
    return;
}

exit;

...but Dumper is coming up empty, so the dereferencing and printing that immediately follows is empty, too. I know the file stat collecting is working, because if I replace the "push @{$SweepStats{$FileName}}" statements with print statements, I see exactly what is expected. I just need to properly access the hashes from the global level, but I cannot get it quite right. What am I doing wrong here? There are all kinds of posts about passing hashes to subroutines, but not the other way around.

Thanks!

zdim
  • 64,580
  • 5
  • 52
  • 81
Chris
  • 179
  • 1
  • 11
  • 1
    If you want to limit the numbers of simulatenous children, you could use Parallel::ForkManager. It even provides a means of returning data to the parent. – ikegami Jun 05 '20 at 20:00

2 Answers2

3

The fork call creates a new, independent process. That child process and its parent cannot write to each other's data. So in order for data to be exchanged between the parent and the child we need to use some Inter-Process-Communication (IPC) mechanism.

It is by far easiest to use a library that takes care of details, and Parallel::ForkManager seems rather suitable here as it provides an easy way to pass the data from child back to the parent, and it has a simple queue (to keep the number of simultaneous processes limited to a given number).

Here is some working code, and comments follow

use warnings;
use strict;
use feature 'say';

use File::Find;
use File::Spec;
use Parallel::ForkManager;
    
my %file_stats;  # written from callback in run_on_finish()
 
my $pm = Parallel::ForkManager->new(16);

$pm->run_on_finish(
    sub {  # 6th argument is what is passed back from finish()
        my ($pid, $exit, $ident, $signal, $core, $dataref) = @_;
        foreach my $file_name (keys %$dataref) {
            $file_stats{$file_name} = $dataref->{$file_name};
        }
    }   
);

my @PartitionRoots  = '.';  # For my tests: current directory,
my @PatternsToCheck = '';   # no filter (pattern is empty string)

my %stats;  # for use by File::Find in child processes

foreach my $RootPath (@PartitionRoots) {
    foreach my $Pattern (@PatternsToCheck) {
        my @dirs = grep { -d } glob "$RootPath/$Pattern*";
        foreach my $dir (@dirs) { 
            #say "Looking inside $dir";       
            $pm->start and next;          # child process
            find(\&get_file_stats, $dir);
            $pm->finish(0, { %stats });   # exits, {%stats} passed back
        }
    }   
}
$pm->wait_all_children;
    
sub get_file_stats {
    return if not -f;
    #say "\t$File::Find::name";

    my ($uid, $size, $mtime) = (stat)[4,7,9];
    my $uname = getpwuid $uid;

    push @{$stats{$File::Find::name}}, {
        path => $File::Find::dir,
        dir  => ( File::Spec->splitdir($File::Find::dir) )[-1],
        uname => (defined $uname and $uname ne '') ? $uname : $uid,
        mtime => $mtime,
        size => $size,
        file => $File::Find::name
    };
}

Comments

  • The main question in all this is: at which part of your three-level hierarchy to spawn child processes? I left it as in the question, where for each directory a child is forked. This may be suitable if (most of) directories have many files; but if it isn't so and there is little work for each child to do then it may all get too busy and the overhead may reduce/deny the speedup

  • The %stats hash, necessary for File::Find to store the data it finds, need be declared outside of all loops so that it is seen in the sub. So it is inherited by all child processes, but each gets its own copy as due and we need not worry about data overlap or such

  • I simplified (and corrected) code other than the forking as well, following what seemed to me to be desired. Please let me know if that is off

  • See linked documentation, and for example this post and links in it for details


In order to display complex data structures use a library, of which there are many.

I use Data::Dump, intended to simply print nicely,

use Data::Dump qw(dd pp);
...
dd \%file_stats;  # or: say "Stats for all files: ", pp \%file_stats;

for its compact output, while the most widely used is the core Data::Dumper

use Data::Dumper
...
say Dumper \%file_stats;

which also "understands" data structures (so you can mostly eval them back).

(Note: In this case there'll likely be a lot of output! So redirect to a file, or exit those loops after the first iteration so just to see how it's all going.)


As a process is forked the variables and data from the parent are available to the child. They aren't copied right away for efficiency reasons, so initially the child does read parent's data. But any data generated after the fork call in the parent or child cannot be seen by the other process.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • @Chris I can imagine that this may be a bit much to take in all at once, please let me know how it goes – zdim Jun 06 '20 at 10:36
  • Thanks @zdim ! It took some effort to get Parallel::ForkManager fully installed with all of its dependencies--it is a door made up of doors. But it is working, as is your code. My only follow-up questions are: 1) can I change the ForkManager->new(16) to be an integer-based variable that represents the number (within reason, of course) of directories collected? and 2) how do I dereference and print the contents of %stats after $pm=>wait_all_children completes? – Chris Jun 06 '20 at 14:02
  • @Chris Great that things worked out :). (1) The argument to `new` is how many processes (at most) it is going to keep running. So once it got to 16 it won't create any more but will sit there (at `$pm->start`) and wait until a process exits, and _then_ it creates another. The "number of directories collected" ... you can first figure out how many there are and then create a `P::FM` object (with `new`)? Or, count directories as you go and condition creation of new processes to that number? (Follow a link in my post, linked in this answer, for an example.) – zdim Jun 06 '20 at 20:58
  • @Chris (2) There is a number of modules that can print complex data structures. Some are designed to actually be able to reproduce valid Perl data structures later, some to only display nicely. Added to the post. I mention two but there's a number of yet others, to suit anyone's fancy. – zdim Jun 06 '20 at 21:16
  • @Chris "_epresents the number (...) of directories collected?_" --- so do you mean that if you have, say, 36 directories you use 12 processes, or some such? I'd say that you don't need that. If you can spare 4 cores for this job then use `new(4)`, or perhaps `new(8)` or even 16 as this is light-ish on CPU, if you want every cycle. (Time it.) If there are _many_ directories then a queue is precisely what you want, and the exact number of processes is less important (up to what your hardware can do nicely). – zdim Jun 06 '20 at 21:42
0

Try this module: IPC::Shareable. It recommended by perldoc perlipc, and you can find answer to your question here.

k-mx
  • 667
  • 6
  • 17
  • Thanks, @k-mx I tried that IPC::Shareable module and got it so close to working. But I ran into the dreaded "Length of shared data exceeds shared segment size at" message, which I cannot seem to work around. In the end, the amount of data collected is 50+ MB, so I clearly need to rethink this whole approach... – Chris Jun 06 '20 at 12:47