4

Currently, I have an application which is using XML::Twig and parses 20 XML files. Each file amounts to 0.5GB and the processing is done in a sequential manner:

foreach (@files) {  
    my $ti = XML::Twig->new( 
        keep_encoding => 1,
        twig_handlers => {
            'section' => sub { $_->purge(); }
        }
    )->parsefile($_);
}

Is there a way with perl to run this code in parallel and if yes how can I do it? My application runs on a Windows system.

smith
  • 3,232
  • 26
  • 55
  • @MathiasMüller this edit looks syntactically incorrect. Ordering of `)->parsefile($_);` and the last `}` ? – David-SkyMesh Dec 30 '13 at 12:56
  • Ah, nope, just a missing `}`. Fixing it now.... – David-SkyMesh Dec 30 '13 at 12:58
  • This has nothing to do with the question. If you think you can improve the quality of the question, suggest another edit instead of writing a comment. – Mathias Müller Dec 30 '13 at 12:58
  • @smith what is the parsed XML used for? That will impact most on how you parallelize it. – David-SkyMesh Dec 30 '13 at 13:00
  • It using to parse every section and update changes on DB for every section accordingly – smith Dec 30 '13 at 13:02
  • With files that big, what @JonFast has said will work, but you'd get much better performance using a SAX or Pull-type parser [1 fork per core], then having an additional pool of forks that just deal with database insertions being fed by some sort of queue [FIFOs, Thread::Queue or ZeroMQ] - as many forks as your database will efficiently allow to update the table at once. – David-SkyMesh Dec 31 '13 at 02:34

1 Answers1

2

You should use Parallel::ForkManager off of CPAN. This (with a little included explanation) should allow you to fork each process and parse the files individually, in parallel. Also, be aware that Perl 5 has threads, but the performance gain will probably not be significant.

The provided code on the linked page should do what you want, but I've posted it here for your convenience. As you can see, all it really does is create a new data structure for the management of the maximum number of allowed processes and for each new data piece (or file) forks and return the child, does the work, then terminates the process:

use Parallel::ForkManager;

$pm = Parallel::ForkManager->new($MAX_PROCESSES);

foreach $data (@all_data) {
  # Forks and returns the pid for the child:
  my $pid = $pm->start and next;

  ... do some work with $data in the child process ...

  $pm->finish; # Terminates the child process
}

Be aware that you may want to use WINAPI on Windows if you want to create processes there, (as Parallel::ForkManager I believe uses Windows kernel level threading, though should still perform the task adequately). Perl also gives you the option of using Win32::API's CreateProcess() function to do multiprocessing in Perl (provided you import it). There's also the option of the Forks::Super package for multiprocessing, which works on Windows as well.

Jon Fast
  • 171
  • 6