Optimize searching the difference between two directories recursively in perl

Question

I am trying to find the difference between two directories in Perl. I want to optimize it to run efficiently and also not sure how to ignore certain files (say with extension .txt or .o)

The code I have so far is:

use strict;
use warnings;
use Parallel::ForkManager;
use File::Find;
use List::MoreUtils qw(uniq);

my $dir1 = "/path/to/dir/first";
my $dir2 = "/path/to/dir/second";
my @comps = ('abc');
my (%files1, %files2);
my $workernum = 500; 
my $pm = new Parallel::ForkManager($workernum);
my @common = ();
my @differ = ();
my @only_in_first = ();
my @only_in_second = ();

foreach my $comp (@comps) {
    find( sub { -f  ($files1{$_} = $File::Find::name) }, "$dir1");
    find( sub { -f  ($files2{$_} = $File::Find::name) }, "$dir2");
    my @all = uniq(keys %files1, keys %files2);
    for my $file (@all) {
        my $pid = $pm->start and next; # do the fork
        my $result;
        if ($files1{$file} && $files2{$file}) { # file exists in both dirs
            $result = qx(/usr/bin/diff -q $files1{$file} $files2{$file});
            if ($result =~m/^Common subdirectories/) {
                push (@common, $result);
            } else {
                push (@differ, $result);
            }
        } elsif ($files1{$file}) { 
            push (@only_in_first, $file);
        } else {
            push (@only_in_second, $file);
        }
        $pm->finish; # do the exit in child process
    }
}

I have to guess your actual question. If you set variables in the forked processes, then these changes are not visible in the parent process. Please look at "RETRIEVING DATASTRUCTURES from child processes" in the Parallel::ForkManager manpage. — Slaven Rezic, Sep 18 '13 at 07:04

score 0 · Answer 1 · answered Sep 18 '13 at 09:04

0

The diff utility hhas a -r swith which allows it to work in subdirectories.

It is not sufficient for you?

answered Sep 18 '13 at 09:04

user1126070

5,059
1
16
15

score 0 · Answer 2 · answered Sep 18 '13 at 20:22

Yes, diff -r does indeed do what your code also does. However, diff -r doesn't do it with 500 worker processes. Then again diff -r is maybe fast enough that it doesn't need 500 processes in parallel.

Things of note:

"$var" is rarely needed and better written as $var
using 2 hashes as a diff but still using a uniq() with 2 arrays of the keys of the hashes is a waste of memory and cpu cycles
using diff -q can easily made easy yourself in perl or at least be sped up easily by first stat()'ing both files and at least compare the size before doing a fork. If the files are small, one could use perl.
if you really want to diff -q forked, at least check $? as there might be problems with e.g. the location of find or the execution. In fact, checking the exit code is enough instead of doing a grep on stdout/stderr
for simplicity, use find from the PATH, not an absolute path

Optimize searching the difference between two directories recursively in perl

2 Answers2