1

[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]

Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.

I've got a script that produces frequency lists. It essentially does the following:

  • Reads in lines from a file having the format $frequency \t $item. Any given $item may occur multiple times, with different values for $frequency.
  • Eliminates certain lines depending on the content of $item.
  • Sums the frequencies of all identical $items, regardless of their case, and merges these entries into one.
  • Performs a reverse natural sort on the resulting array.
  • Prints the results to an output file.

The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).

The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.

So I need to somehow optimize the script. And that's what I'm here asking for help with.

Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.

I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.

Thanks in advance!

(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.

#!/usr/bin/env perl

use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);

# DEFINE VARIABLES
my $delimiter            = "\t";
my $split_char           = "\t";

my $input_file_name  = "";
my $output_file_name = "";
my $in_basename      = "";
my $frequency        = 0;
my $item             = "";

# READ COMMAND LINE OPTIONS
GetOptions (
             "input|i=s"         => \$input_file_name,
             "output|o=s"        => \$output_file_name,
           );

# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
    die
      "\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}

# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {

    # STRIP EXTENSION FROM INPUT FILE NAME
    $in_basename = $input_file_name;
    $in_basename =~ s/(.+)\.(.+)/$1/;

    # GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
    $output_file_name = "$in_basename.output.txt";
}

# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
    or die "\nERROR: Can't open input file ($input_file_name): $!";

# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";

# PROCESS INPUT FILE LINE BY LINE
my %F;

while (<INPUTFILE>) {

    chomp;

    # PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
    ( $frequency, $item ) = split( /$split_char/, $_, 2 );

    # Skip lines with empty or undefined content, or spaces only in $item
    next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;

    # PROCESS INPUT LINES
    $F{ lc($item) } += $frequency;
}
close INPUTFILE;

# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
    || die "\nERROR: The output file \($output_file_name\) couldn't be opened for writing!\n";

# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
    print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}

close OUTPUTFILE;

exit;

Below is some sample input from the source file. It's tab-separated, and the first column is $frequency, while all the rest together is $item.

2   útil    volver  a   valdivia
8   útil    volver  la  vista
1   útil    válvula de  escape
1   útil    vía de  escape
2   útil    vía fax y
1   útil    y   a   cabalidad
43  útil    y   a   el
17  útil    y   a   la
1   útil    y   a   los
21  útil    y   a   quien
1   útil    y   a   raíz
2   útil    y   a   uno
Gil Williams
  • 359
  • 1
  • 3
  • 10
  • Maybe use a divide and conquer approach? Split your large files into several smaller files, process them, aggregate the results. – xxfelixxx Apr 07 '17 at 00:58
  • That doesn't work because a given value of $item (say, "apple") may appear 1000 times throughout the input file, each time with a different frequency. And I need to merge ALL instances of "apple" (or whatever) and sum ALL their frequencies. Dividing the input file into multiple files will only ever achieve this within each split file, and not globally. – Gil Williams Apr 07 '17 at 01:05
  • The *globally* happens when merge and aggregate the results. You're not going to find a way to do it without splitting it somehow. – Ken White Apr 07 '17 at 01:10
  • I actually have split it already, in a pre-processing step in order to clean out invalid entries. That reduced the total file size by 3-5%, which still leaves me with massive files that I can't process when I merge them back. The key, I believe, is that somehow I've managed to get Perl to use 70+ GB of memory while processing a 3-6 GB file. Hence, I am **convinced** that I'm doing something wrong at the programming level. I mean, let's be honest -- if there were an IgNobel prize for coding, I'd have just won it. – Gil Williams Apr 07 '17 at 01:14
  • Possibly, however, none of the code you have posted would use up that much memory, given your input file size. Perhaps we need to seem more code. The construct of looping over rows of data and counting is usually rather efficient. – xxfelixxx Apr 07 '17 at 01:17
  • http://stackoverflow.com/questions/1359771/perl-memory-usage-profiling-and-leak-detection – xxfelixxx Apr 07 '17 at 01:19
  • Thanks for the link, but I really can't make heads or tails of how I'd implement memory profiling. – Gil Williams Apr 07 '17 at 01:29
  • Looping over rows of data and counting is probably not the issue. The massive array probably is, I'm guessing. – Gil Williams Apr 07 '17 at 01:32
  • Maybe..but there is no massive array in the code you presented. – xxfelixxx Apr 07 '17 at 01:50
  • Sorry, I meant hash rather than array. Look for `%F` in the code. – Gil Williams Apr 07 '17 at 02:03
  • 1
    Why do you have the `END { ... }` block inside the while loop? That doesn't make much sense. Are you sure the `%F` hash is the memory usage culprit? Use [Devel::Size](https://metacpan.org/pod/Devel::Size) to verify it. – Ron Bergin Apr 07 '17 at 02:23
  • Instead of building a large (memory footprint) data structure, you might want to consider storing it in a DB. That "should" reduce the memory footprint, but will probably increase the CPU usage. – Ron Bergin Apr 07 '17 at 02:35
  • Good catch on the `END` block. I've moved it and am running the script on a 3.6 GB file right now. But nothing seems to have changed -- it's already consumed 10 GB of RAM in about 20 minutes. – Gil Williams Apr 07 '17 at 03:16
  • Btw, a standard approach when data sizes start biting is to use a database. Like [SQLite](https://www.sqlite.org/) via [DBD::SQLite](http://search.cpan.org/~ishigaki/DBD-SQLite-1.54/lib/DBD/SQLite.pm) in Perl, or mySQL. More efficient at everything, in particular for sorting, and much easier to learn than Perl. – zdim Apr 07 '17 at 04:27
  • I combined my initial comments into an answer, let me know if explanations are needed. I suspect that there may be more going on in code that is not shown, so some debugging ideas: how is memory doing when you start dropping parts of the process? Start with sorting -- write it out unsorted. – zdim Apr 07 '17 at 06:03
  • How many unique `$item`s do you have? – tripleee Apr 07 '17 at 07:14
  • @tripleee, in this file there are about 250 million lines (with 1 `$item` per line). And I've got bigger files to process later! – Gil Williams Apr 07 '17 at 08:31
  • The question is how many times is an `$item` repeated because that will dictate your approach. If they are very sparse (meaning you have close to 250M unique items) keeping it all in memory is probably not feasible. – tripleee Apr 07 '17 at 08:32
  • Depending on what the items look like and how many times you need to run this, you might consider running multiple passes over the input and processing just a subset of the matches. For a simple example, if your range is 0-99, do only 0-24 on the first pass and just ignore items outside this range. Repeat as necessary. – tripleee Apr 07 '17 at 08:36

2 Answers2

3

UPDATE   In my tests a hash takes 2.5 times the memory that its data alone takes. However, the program size for me is consistently 3-4 times as large as its variables. This turns a 6.3Gb data file into a ~15Gb hash, for a ~60Gb program, just as reported in comments.

So 6.3Gb == 60Gb, so to say. This still improved the starting situation enough so to work for the current problem but is clearly not a solution. See the (updated) Another approach below for a way to run this processing without loading the whole hash.


There is nothing obvious to lead to an order-of-magnitude memory blowup. However, small errors and inefficiences can add up so let's first clean up. See other approaches at end.

Here is a simple re-write of the core of the program, to try first.

# ... set filenames, variables 
open my $fh_in, '<:encoding(utf8)', $input_file_name
    or die "\nERROR: Can't open input file ($input_file_name): $!";

my %F;    
while (<$fh_in>) {    
    chomp;
    s/^\s*//;                                              #/trim  leading space
    my ($frequency, $item) = split /$split_char/, $_, 2;

    # Skip lines with empty or undefined content, or spaces only in $item
    next if not defined $frequency or $frequency eq '' 
         or not defined $item      or $item =~ /^\s*$/;

    # ... increment counters and aggregates and add to hash
    # (... any other processing?)
    $F{ lc($item) } += $frequency;
}
close $fh_in;

# Sort and print to file
# (Or better write: "value key-length key" and sort later. See comments)
open my $fh_out, '>:encoding(utf8)', $output_file_name 
    or die "\nERROR: Can't open output file ($output_file_name\: $!";

foreach my $item ( sort { 
        $F{$b} <=> $F{$a} || length($b) <=> length($a) || $a cmp $b 
    } keys %F )
{
    print $fh_out $F{$item}, "\t", $item, "\n";
}
close $fh_out;

A few comments, let me know if more is needed.

  • Always add $! to error-related prints, to see the actual error. See perlvar.

  • Use lexical filehandles (my $fh rather than IN), it's better.

  • If layers are specified in the three-argument open then layers set by open pragma are ignored, so there should be no need for use open ... (but it doesn't hurt either).

The sort here has to at least copy its input, and with multiple conditions more memory is needed.

That should take no more memory than 2-3 times the hash size. While initially I suspected a memory leak (or excessive data copying), by reducing the program to basics it was shown that the "normal" program size is the (likely) culprit. This can be tweaked by devising custom data structures and packing the data economically.

Of course, all this is fiddling if your files are going to grow larger and larger, as they tend to do.

Another approach is to write out the file unsorted and then sort using a separate program. That way you don't combine the possible memory swelling from processing with final sorting.

But even this pushes the limits, due to the greatly increased memory footprint as compared to data, since hash takes 2.5 times the data size and the whole program is yet 3-4 as large.

Then find an algorithm to write the data line-by-line to the output file. That is simple to do here since by the shown processing we only need to accumulate frequencies for each item

open my $fh_out, '>:encoding(utf8)', $output_file_name 
    or die "\nERROR: Can't open output file ($output_file_name\: $!";

my $cumulative_freq;

while (<$fh_in>) {
    chomp;
    s/^\s*//;  #/ leading only
    my ($frequency, $item) = split /$split_char/, $_, 2;

    # Skip lines with empty or undefined content, or spaces only in $item
    next if not defined $frequency or $frequency eq '' 
         or not defined $item      or $item =~ /^\s*$/;

    $cumulative_freq += $frequency;  # would-be hash value

    # Add a sort criterion, $item's length, helpful for later sorting
    say $fh_out $cumulative_freq, "\t", length $item, "\t", lc($item);

    #say $fh_out $cumulative_freq, "\t", lc($item);
}
close $fh_out;

Now we can use the system's sort, which is optimized for very large files. Since we wrote a file with all sorting columns, value key-length key, run in a terminal

sort -nr -k1,1 -k2,2 output_file_name | cut -f1,3-  > result

The command sorts numerically by the first and then by the second field (then it sorts by third itself) and reverses the order. This is piped into cut which pulls out the first and third fields from STDIN (with tab as default delimiter), what is the needed result.

A systemic solution is to use a database, and a very convenient one is DBD::SQLite.


I used Devel::Size to see memory used by variables.


A minimal "scalar" structure, built internally to store a value, takes (on my box and perl) 28 bytes. Then each key and value need a scalar... (Each array and hash structure -- even when anonymous -- take a couple hundred bytes as well, but there's likely way fewer of those in a complex data structure.)

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/141294/discussion-on-answer-by-zdim-need-serious-help-optimizing-script-for-memory-use). – Bhargav Rao Apr 09 '17 at 18:06
  • @GilWilliams That long exchange got moved to chat, please review my last comment. One important part: did you see my addition to the answer, under (edited) **Another approach** (or is there a reason that it won't work for you)? It reads _and prints to file_ line by line and should do away with memory problems. – zdim Apr 09 '17 at 23:06
0

Sorting input requires keeping all input in memory, so you can't do everything in a single process.

However, sorting can be factored: you can easily sort your input into sortable buckets, then process the buckets, and produce the correct output by combining the outputs in reversed-sorted bucket order. The frequency counting can be done per bucket as well.

So just keep the program you have, but add something around it:

  1. partition your input into buckets, e.g. by the first character or the first two characters
  2. run your program on each bucket
  3. concatenate the output in the right order

Your maximum memory consumption will be slightly more than what your original program takes on the biggest bucket. So if your partitioning is well chosen, you can arbitrarily drive it down.

You can store the input buckets and per-bucket outputs to disk, but you can even connect the steps directly with pipes (creating a subprocess for each bucket processor) - this will create a lot of concurrent processes, so the OS will be paging like crazy, but if you're careful, it won't need to write to disk.

A drawback of this way of partitioning is that your buckets may end up being very uneven in size. An alternative is to use a partitioning scheme that is guaranteed to distribute the input equally (e.g. by putting every nth line of input into the nth bucket) but that makes combining the outputs more complex.

reinierpost
  • 8,425
  • 1
  • 38
  • 70
  • Sorting is actually a breeze with the GNU `sort` utility! While it takes about 3 hours (on a Core i7) and 60+ GB of RAM and swap to run the Perl code that reads the source file and sums the frequencies of identical items on a 3.6 GB file, sorting the similarly sized output file with `sort` only uses about 10 GB of RAM and less than 20 minutes, using the `parallel=7` flag in order to use 7 of my 8 cores. – Gil Williams Apr 08 '17 at 18:22
  • It probably uses the same principle. I know it uses temporary files on disk. – reinierpost Apr 08 '17 at 23:05