Most memory-efficient way to combine word stemming and the elimination of hash words in Perl?

Question

I've patched together some Perl script intended to take each word from a batch of documents, eliminate all stop words, stem the remaining words, and create a hash containing each stemmed word and its frequency of occurrence. However, after working on it for several minutes, I get an "Out of Memory!" message in the command window. Is there a more efficient way to achieve the desired result, or do I just need to find a way to access more memory?

#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::StopWords qw(%StopWords);
use Lingua::Stem qw(stem);
use Mojo::DOM;

my $path = "U:/Perl/risk disclosures/2006-28";
chdir($path) or die "Cant chdir to $path $!";

# This program counts the total number of unique sentences in a 10-K and enumerates the frequency     of each one.

my @sequence;
my %sequences;
my $fh;

# Opening each file and reading its contents.
for my $file (<*.htm>) {
    my $data = do {
        open my $fh, '<', $file;
        local $/;    # Slurp mode
        <$fh>;
    };
    my $dom  = Mojo::DOM->new($data);
    my $text = $dom->all_text();
    for ( split /\s+/, $text ) {
        # Here eliminating stop words.
        while ( !$StopWords{$_} ) {
            # Here retaining only the word stem.
            my $stemmed_word = stem($_);
            ++$sequences{"$stemmed_word"};
        }
    }
}

I think you need to change `while (!$StopWords{$_}) { ... }` to `next if defined $StopWords{$_};`. You're already checking one word at a time with `for (split ...)`, so either that word is a stop-word or it isn't, no need for a second loop. — ThisSuitIsBlackNot, Sep 17 '14 at 19:37
Yes, that did get rid of the "Out of Memory" error message, thank you! — Rick, Sep 17 '14 at 20:12

score 0 · Accepted Answer · answered Sep 17 '14 at 20:52

If a word is not in %StopWords, you enter an infinite loop:

while ( !$StopWords{$_} ) {
    my $stemmed_word = stem($_);
    ++$sequences{"$stemmed_word"};

    # %StopWords hasn't changed, so $_ is still not in it
}

There's actually no reason to use a loop here at all. You're already checking one word at a time with your for loop. A word is either a stop-word or it isn't, so you only need to check it once.

I would do something more like the following:

my $dom  = Mojo::DOM->new($data);
my @words = split ' ', $dom->all_text();

foreach my $word (@words) {
    next if defined $StopWords{$word};

    my $stemmed_word = stem $word;
    ++$sequences{$stemmed_word};
}

In addition to replacing the inner while loop with

next if defined $StopWords{$word};

I also

removed the intermediate $text variable, since it seems like you really only care about individual words, not the full block of text
added an explicit loop variable in the for. Various functions change $_ automatically so to avoid unintended side-effects, I use explicit loop variables for everything but one-liners like say for @array;
removed extraneous quotation marks from ++$sequences{"$stemmed_word"};

I've incorporated all of your suggestions and that part of my code seems to work well now, thanks! — Rick, Sep 18 '14 at 14:43

Most memory-efficient way to combine word stemming and the elimination of hash words in Perl?

1 Answers1