[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]
Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.
I've got a script that produces frequency lists. It essentially does the following:
- Reads in lines from a file having the format
$frequency \t $item
. Any given$item
may occur multiple times, with different values for$frequency
. - Eliminates certain lines depending on the content of
$item
. - Sums the frequencies of all identical
$item
s, regardless of their case, and merges these entries into one. - Performs a reverse natural sort on the resulting array.
- Prints the results to an output file.
The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).
The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.
So I need to somehow optimize the script. And that's what I'm here asking for help with.
Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.
I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.
Thanks in advance!
(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.
#!/usr/bin/env perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);
# DEFINE VARIABLES
my $delimiter = "\t";
my $split_char = "\t";
my $input_file_name = "";
my $output_file_name = "";
my $in_basename = "";
my $frequency = 0;
my $item = "";
# READ COMMAND LINE OPTIONS
GetOptions (
"input|i=s" => \$input_file_name,
"output|o=s" => \$output_file_name,
);
# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
die
"\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}
# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {
# STRIP EXTENSION FROM INPUT FILE NAME
$in_basename = $input_file_name;
$in_basename =~ s/(.+)\.(.+)/$1/;
# GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
$output_file_name = "$in_basename.output.txt";
}
# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
or die "\nERROR: Can't open input file ($input_file_name): $!";
# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";
# PROCESS INPUT FILE LINE BY LINE
my %F;
while (<INPUTFILE>) {
chomp;
# PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
( $frequency, $item ) = split( /$split_char/, $_, 2 );
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;
# PROCESS INPUT LINES
$F{ lc($item) } += $frequency;
}
close INPUTFILE;
# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
|| die "\nERROR: The output file \($output_file_name\) couldn't be opened for writing!\n";
# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}
close OUTPUTFILE;
exit;
Below is some sample input from the source file. It's tab-separated, and the first column is $frequency
, while all the rest together is $item
.
2 útil volver a valdivia
8 útil volver la vista
1 útil válvula de escape
1 útil vía de escape
2 útil vía fax y
1 útil y a cabalidad
43 útil y a el
17 útil y a la
1 útil y a los
21 útil y a quien
1 útil y a raíz
2 útil y a uno