About Perl reading the webpage online via HTTP

Question

I have a huge webpage, which is about 5G size. And I hope I could read the content of the webpage directly(remotely) without downloading the whole file. I have used the Open File Handler to open the HTTP content. But the error message given is No such files or directory. I tried to use LWP::Simple, but it was out of memory if I use get the whole content. I wonder if there is a way that I could open this content remotely, and read line by line. Thank you for your help.

It is static, and it is the log file with 5G Size.. The LWP::Simple will simply generated an "Out of Memory".. — Chris Andrews, Jan 31 '13 at 06:23

score 0 · Answer 1 · answered Jan 31 '13 at 06:10

You could try using LWP::UserAgent. The request method allows you to specify a CODE reference, which would let you process the data as it's coming in.

#!/usr/bin/perl -w

use strict;
use warnings;

use LWP::UserAgent ();
use HTTP::Request ();

my $request = HTTP::Request->new(GET => 'http://www.example.com/');
my $ua = LWP::UserAgent->new();

$ua->request($request, sub {
        my ($chunk, $res) = @_;
        print $chunk;
        return undef;
});

Technically the function should return the content instead of undef, but it seems to work if you return undef. According to the documentation:

The "content" function should return the content when called. The content function will be invoked repeatedly until it return an empty string to signal that there is no more content.

I haven't tried this on a large file, and you would need to write your own code to handle the data coming in as arbitrarily sized chunks.

Nice Job!! Thank you. It works in my logs via HTTP. But I still wonder about what is the $res in your code stands for.. Thanks. — Chris Andrews, Jan 31 '13 at 06:30
Woops, I left that in there accidentally. It is a reference the the `HTTP::Response` object, which might be handy. I'll leave it in my answer for now. — chipschipschips, Jan 31 '13 at 06:42

score 0 · Answer 2 · answered Jan 31 '13 at 08:00

This Perl code will download file from URL with possible continuation if file was already partially downloaded.

This code requires that server returns file size (aka content-length) on HEAD request, and also requires that server supports byte ranges on URL in question.

If you want some special processing for next chunk, just override it below:

use strict;
use LWP::UserAgent;
use List::Util qw(min max);

my $url  = "http://example.com/huge-file.bin";
my $file = "huge-file.bin";

DownloadUrl($url, $file);

sub DownloadUrl {
    my ($url, $file, $chunksize) = @_;
    $chunksize ||= 1024*1024;
    my $ua = new LWP::UserAgent;
    my $res = $ua->head($url);
    my $size = $res->headers()->{"content-length"};
    die "Cannot get size for $url" unless defined $size;
    open FILE, ">>$file" or die "ERROR: $!";      
    for (;;) {
        flush FILE;
        my $range1 = -s FILE;        
        my $range2 = min($range1 + $chunksize, $size);
        last if $range1 eq $range2;
        $res = $ua->get($url, Range => "bytes=$range1-$range2");
        last unless $res->is_success();
        # process next chunk:
        print FILE $res->content();
    }
    close FILE;
}

About Perl reading the webpage online via HTTP

2 Answers2