Efficient downloading of 10-K filings from SEC website

Question

I use the following perl code to mass download 10-Ks from the SEC website. However, I get an "Out of memory!" message every few hundred files when the script apparently gets stuck processing an especially large 10-K filing. Any ideas how I can avoid this "Out of memory!" error for large files?

#!/usr/bin/perl
use strict;
use warnings;
use LWP;

my $ua = LWP::UserAgent->new;

open LOG , ">download_log.txt" or die $!;
######## make sure the file with the ids/urls is in the 
######## same folder as the perl script
open DLIST, "downloadlist.txt" or die $!;
my @file = <DLIST>;

foreach my $line (@file) {
        #next if 0.999 > rand ;
        #print "Now processing file: $line\n" ;
    my ($nr, $get_file) = split /,/, $line;
    chomp $get_file;
    $get_file = "http://www.sec.gov/Archives/" . $get_file;
    if ($get_file =~ m/([0-9|-]+).txt/ ) {
        my $filename = $nr . ".txt";
        open OUT, ">$filename" or die $!;
        print "file $nr \n";
        my $response =$ua->get($get_file);
        if ($response->is_success) {
            print OUT $response->content;
            close OUT;
        } else {
            print LOG "Error in $filename - $nr \n" ;
        }
    }
}

Thank you, @Sinan Ünür. I am unfamiliar with how response_data handlers keep my files from being stored in memory (and with response_data handlers more generally). Could you provide a little insight into how you would most efficiently incorporate that into the code above? — Rick, Apr 21 '17 at 20:48
Actually, let `$ua` [handle the saving](https://metacpan.org/pod/LWP::UserAgent#get). — Sinan Ünür, Apr 21 '17 at 20:58
Is that different from the "my $response = $ua->($get_file);" line currently in the code? What exactly are you suggesting? — Rick, Apr 21 '17 at 21:07
I am suggesting you do not read the entire file content into a variable. Do read what I linked to: ***If a $filename is provided with the :content_file option, then the response content will be saved here instead of in the response object.*** — Sinan Ünür, Apr 21 '17 at 21:11
Not related to the out of memory, and irrelevant if you follow Sinan's good advice, but you almost always want to use ->decoded_content, not ->content — ysth, Apr 21 '17 at 21:17
I've added the :content_file option, and so far I've not had any problems with running out of memory. I won't know for sure until it's done processing in a couple of days, if problems arise I will repost. Thanks! — Rick, Apr 21 '17 at 21:59

score 1 · Answer 1 · edited Apr 22 '17 at 22:54

1

I recently ran into a similar problem using threads and thousands of LWP requests. Never figured out what the memory leak was, but switching to HTTP::Tiny resolved it.

Going from LWP to HTTP::Tiny is simple:

use HTTP::Tiny;

my $ua = HTTP::Tiny->new;

my $response =$ua->get($get_file);
if ($response->{success}) {
    print OUT $response->{content};

... of course HTTP::Tiny could just do the saving part for you, like LWP.

You could also try to make a new LWP object inside the loop, hoping for the garbage collection to kick in, but it didn't work for me either. There is something inside the LWP monster that leaks.

Edit: There could also be a problem with trying to download a 2gb file into a string, the mirror method should solve that for you.

edited Apr 22 '17 at 22:54

Sinan Ünür

116,958
15
196
339

answered Apr 21 '17 at 21:27

xrmb

19
4

Welcome to Stack Overflow and the Perl tag. Very good first answer. :-) – simbabque Apr 22 '17 at 00:00
1

*"There is something inside the LWP monster that leaks"* The problem is not in the LWP module suite, which has been extensively used and tested by a huge number of people since it was first released in May 1996. If you have written a program that uses LWP and leaks memory then it is not a problem with the module. – Borodin Apr 22 '17 at 00:34

Borodin · Answer 2 · 2017-04-22T00:37:46.933

Just get LWP to store the response data directly into a file instead of in the HTTP::Response object. It's also simpler to code that way

Here's an example program. I can't test it at present but it does compile

I've recently noticed a lot of people writing code to read entire files into memory before processing the data and I don't understand why it's so popular. It's wasteful of memory and is often more difficult to code a solution that way. I've changed your program to read a line at a time of the download list file and use it directly instead of storing it into an array

use strict;
use warnings 'all';

use LWP;

my $ua = LWP::UserAgent->new;

open my $dl_fh,  '<', 'downloadlist.txt' or die "Can't open download list file: $!";

open my $log_fh, '>', 'download_log.txt' or die "Can't open log file: $!";

STDOUT->autoflush;

while ( <$dl_fh> ) {

    # next if 0.999 > rand;
    # print "Now fetching file: $_";

    chomp;
    my ($num, $dl_file) = split /,/;

    unless ( $dl_file =~ /[0-9|-]+\.txt/ ) {
        print $log_fh qq{Skipping invalid file "$dl_file"\n};
        next;
    }

    my $url      = "http://www.sec.gov/Archives/$dl_file";
    my $filename = "$num.txt";

    print qq{Fetching file $filename\n};

    my $resp = $ua->get($url, ':content_file' => $filename);

    printf $log_fh qq{Download of "%s" %s\n},
            $filename,
            $resp->is_success ?
            'successful' :
            'FAILED: ' . $resp->status_line;
}

I have verified that this code does the job as well. The processing speed appears to be relatively similar between both sets of code. — Rick, Apr 24 '17 at 14:37
@Rick: Thanks. I appear to have been hit by the drive-by downvoter! The other answer has the disadvantage that it still reads the whole file into memory before writing it to disk, which unnecessarily wastes memory. I'm surprised that `HTTP::Tiny` performs any differently from `LWP` to be honest, as the core problem of memory space hasn't been fixed. This code writes data straight to disk as the data arrives from the internet. — Borodin, Apr 24 '17 at 14:43

Efficient downloading of 10-K filings from SEC website

2 Answers2