I've patched together some Perl script intended to take each word from a batch of documents, eliminate all stop words, stem the remaining words, and create a hash containing each stemmed word and its frequency of occurrence. However, after working on it for several minutes, I get an "Out of Memory!" message in the command window. Is there a more efficient way to achieve the desired result, or do I just need to find a way to access more memory?
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::StopWords qw(%StopWords);
use Lingua::Stem qw(stem);
use Mojo::DOM;
my $path = "U:/Perl/risk disclosures/2006-28";
chdir($path) or die "Cant chdir to $path $!";
# This program counts the total number of unique sentences in a 10-K and enumerates the frequency of each one.
my @sequence;
my %sequences;
my $fh;
# Opening each file and reading its contents.
for my $file (<*.htm>) {
my $data = do {
open my $fh, '<', $file;
local $/; # Slurp mode
<$fh>;
};
my $dom = Mojo::DOM->new($data);
my $text = $dom->all_text();
for ( split /\s+/, $text ) {
# Here eliminating stop words.
while ( !$StopWords{$_} ) {
# Here retaining only the word stem.
my $stemmed_word = stem($_);
++$sequences{"$stemmed_word"};
}
}
}