2

I have a problem with the following code under the latest release of Strawberry Perl for Windows: I want to read in all text files in a directory and process their contents. I don't currently see a way to process them line by line, as some of the changes I want to make to the file contents go across newlines. The processing largely involves removing large chunks of the files (in my example code below, it is just one line, but I would ideally run a couple of similar regexes that each cut out stuff from the file)

I am running this script on a large number of files (>10,000) and it always breaks down with an "Out of memory!" message on one particular file that is larger than 400 MB. The thing is that when I write a program that ONLY processes the ONE file, the code works fine.

The machine has 8 GB RAM, so I would think that physical RAM is not the issue.

I read through other posts on memory issues, but did not find anything that would help me achieve my goal.

Can anyone suggest what I would need to change to make the program work, i.e., make it more memory efficient or somehow sidestep the issue?

use strict;
use warnings;
use Path::Iterator::Rule;
use utf8;

use open ':std', ':encoding(utf-8)';

my $doc_rule = Path::Iterator::Rule->new;
$doc_rule->name('*.txt'); # only process text files
$doc_rule->max_depth(3); # don't recurse deeper than 3 levels
my $doc_it = $doc_rule->iter("C:\Temp\");
while ( my $file = $doc_it->() ) { # go through all documents found
    print "Stripping $file\n";

    # read in file
    open (FH, "<", $file) or die "Can't open $file for read: $!";
    my @lines;
    while (<FH>) { push (@lines, $_) }; # slurp entire file
    close FH or die "Cannot close $file: $!";

    my $lines = join("", @lines); # put entire file into one string

    $lines =~ s/<DOCUMENT>\n<TYPE>EX-.*?\n<\/DOCUMENT>//gs; #perform the processing

    # write out file
    open (FH, ">", $file) or die "Can't open $file for write: $!";
    print FH $lines; # dump entire file
    close FH or die "Cannot close $file: $!";
}
user1769925
  • 588
  • 1
  • 6
  • 15
  • 1
    Not sure if you came across this but this is a great answer for heap memory profiling and using memory map the file rather than slurping the file. http://stackoverflow.com/questions/9733146/tips-for-keeping-perl-memory-usage-low – cmidi Feb 04 '15 at 19:33
  • 3
    Instead of doing a multi-line search and replace, why not just: 1) read the file line-by-line until you get to your opening delimiter 2) check that subsequent lines match your desired condition; if they do, don't print anything to your output file until you reach the closing delimiter; if they don't, print them. – ThisSuitIsBlackNot Feb 04 '15 at 19:49
  • 1
    You could try using something like [File::Sip](http://search.cpan.org/~sukria/File-Sip-0.003/lib/File/Sip.pm) – elcaro Feb 04 '15 at 21:38
  • 1
    It looks like you are processing XML data, and you should be using [`XML::Twig`](https://metacpan.org/pod/XML::Twig) – Borodin Feb 04 '15 at 21:42
  • It's not a solution (which is why I'm posting this as a comment), but, for what it's worth, you can read the file line-by-line in one shot as `my @lines = ;` – Max Lybbert Feb 04 '15 at 22:23
  • Thanks for all the comments; no, I am not reading XML. – user1769925 Feb 06 '15 at 14:20
  • And I am still wondering why Perl does not just free up the memory that it is not using anymore; but I am probably not willing to dig deep enough in the Perl internals to truly understand it. – user1769925 Feb 06 '15 at 14:21

4 Answers4

4

Handle the file line by line:

while ( my $file = $doc_it->() ) { # go through all documents found
    print "Stripping $file\n";

    open (my $infh, "<", $file) or die "Can't open $file for read: $!";
    open (my $outfh, ">", $file . ".tmp") or die "Can't open $file.tmp for write: $!";

    while (<$infh>) {
       if ( /<DOCUMENT>/ ) {
           # append the next line to test for TYPE
           $_ .= <$infh>;
           if (/<TYPE>EX-/) {
              # document type is excluded, now loop through 
              # $infh until the closing tag is found.
              while (<$infh>) { last if m|</DOCUMENT>|; }

              # jump back to the <$infh> loop to resume
              # processing on the next line after </DOCUMENT>
              next;
           }
           # if we've made it this far, the document was not excluded
           # fall through to print both lines
       }
       print $outfh $_;
    }

    close $outfh or die "Cannot close $file: $!";
    close $infh or die "Cannot close $file: $!";
    unlink $file;
    rename $file.'.tmp', $file; 
}
Ben Grimm
  • 4,316
  • 2
  • 15
  • 24
  • 1
    The solution I hinted of at the end of my answer but was too lazy to write. +1 – tjd Feb 04 '15 at 20:12
  • Indeed it is. A bit shorter now :) – Ben Grimm Feb 04 '15 at 20:29
  • Have you tested this? You have a superfluous `next`, but more importantly it is far from clear what is in `$_` after your inner `while` loop. It looks like you will be printing the closing `` tag whereas the OP's code deletes it. – Borodin Feb 04 '15 at 21:46
  • 1
    Yes, it's been tested. The `next` jumps back to the top of the `while`, avoiding the print after a successful TYPE match. But, if you couldn't see that then perhaps some comments are in order. – Ben Grimm Feb 04 '15 at 21:48
3

You keep two complete copies of the file in memory at the same time, @lines and $lines. You might consider instead:

open (my $FH, "<", $file) or die "Can't open $file for read: $!";
$FH->input_record_separator(undef); # slurp entire file
my $lines = <$FH>;
close $FH or die "Cannot close $file: $!";

On sufficiently obsolete versions of Perl you may need to explicitly use IO::Handle.

Note also: I've switched to lexical file handles from the bare word versions. I presume you aren't striving for compatibility with Perl v4.

Of course if cutting your memory requirements by half isn't enough, you could always iterate through the file...

tjd
  • 4,064
  • 1
  • 24
  • 34
1

Working with XML using regexes is error prone and inefficient, as code which slurps the whole file as a string shows. To deal with XML you should be using an XML parser. In particular, you want a SAX parser which will work on the XML a piece at a time as opposed to a DOM parser which much read the whole file.

I'm going to answer your question as is because there's some value in knowing how to work line by line.

If you can avoid it, don't read a whole file into memory. Work line by line. Your task seems to be to remove a handful of lines from an XML file for reasons. Everything between <DOCUMENT>\n<TYPE>EX- and <\/DOCUMENT>. We can do that line-by-line by keeping a bit of state.

use autodie;

open (my $infh, "<", $file);
open (my $outfh, ">", "$file.tmp");

my $in_document = 0;
my $in_type_ex  = 0;
while( my $line = <$infh> ) {
    if( $line =~ m{<DOCUMENT>\n}i ) {
        $in_document = 1;
        next;
    } 
    elsif( $line =~ m{</DOCUMENT>}i ) {
        $in_document = 0;
        next;
    }
    elsif( $line =~ m{<TYPE>EX-}i ) {
        $in_type_ex = 1;
        next;
    }
    elsif( $in_document and $in_type_ex ) {
        next;
    }
    else {
        print $outfh $line;
    }
}

rename "$file.tmp", $file;

Using a temp file allows you to read the file while you construct its replacement.

Of course this will fail if the XML document isn't formatted just so (I helpfully added the /i flag to the regexes to allow lower case tags), you should really use a SAX XML parser.

Schwern
  • 153,029
  • 25
  • 195
  • 336
0

While working on a somewhat large (1.2G) file with Perl 5.10.1 on Windows Server 2013, I have noticed that

foreach my $line (<LOG>) {}

fails with out of memory, while

while (my $line = <LOG>) {}

works in a simple script that just runs some regexp'es and prints lines I'm interesting in.

Jørgen Austvik
  • 183
  • 1
  • 1
  • 5
  • That's because in the first sample is evaluated in list context, i.e. the whole file is slurped in. In your second sample is evaluated in scalar context and only one line at a time is fetched. – PerlDuck Oct 24 '16 at 15:47