I have a problem with the following code under the latest release of Strawberry Perl for Windows: I want to read in all text files in a directory and process their contents. I don't currently see a way to process them line by line, as some of the changes I want to make to the file contents go across newlines. The processing largely involves removing large chunks of the files (in my example code below, it is just one line, but I would ideally run a couple of similar regexes that each cut out stuff from the file)
I am running this script on a large number of files (>10,000) and it always breaks down with an "Out of memory!" message on one particular file that is larger than 400 MB. The thing is that when I write a program that ONLY processes the ONE file, the code works fine.
The machine has 8 GB RAM, so I would think that physical RAM is not the issue.
I read through other posts on memory issues, but did not find anything that would help me achieve my goal.
Can anyone suggest what I would need to change to make the program work, i.e., make it more memory efficient or somehow sidestep the issue?
use strict;
use warnings;
use Path::Iterator::Rule;
use utf8;
use open ':std', ':encoding(utf-8)';
my $doc_rule = Path::Iterator::Rule->new;
$doc_rule->name('*.txt'); # only process text files
$doc_rule->max_depth(3); # don't recurse deeper than 3 levels
my $doc_it = $doc_rule->iter("C:\Temp\");
while ( my $file = $doc_it->() ) { # go through all documents found
print "Stripping $file\n";
# read in file
open (FH, "<", $file) or die "Can't open $file for read: $!";
my @lines;
while (<FH>) { push (@lines, $_) }; # slurp entire file
close FH or die "Cannot close $file: $!";
my $lines = join("", @lines); # put entire file into one string
$lines =~ s/<DOCUMENT>\n<TYPE>EX-.*?\n<\/DOCUMENT>//gs; #perform the processing
# write out file
open (FH, ">", $file) or die "Can't open $file for write: $!";
print FH $lines; # dump entire file
close FH or die "Cannot close $file: $!";
}