For an entire week I have been attempting to write a code that will download links from a webpage and then loop through each link to dump the content written on each link's page. The original webpage I downloaded has 500 links to separate web pages that each contain important information for me. I only want to go one level down. However I am having several issues.
RECAP: I want to download the links from a webpage and automatically have my program print off the text contained in those links. I would prefer to have them printed in a file.
1) When I download the links from the original website, the useful ones are not written out fully. (ie they say "/festevents.nsf/all?openform" which is not a usable webpage)
2) I have been unable to print the text content of the page. I have been able to print the font details, but that is useless.
#Download all the modules I used#
use LWP::UserAgent;
use HTML::TreeBuilder;
use HTML::FormatText;
use WWW::Mechanize;
use Data::Dumper;
#Download original webpage and acquire 500+ Links#
$url = "http://wx.toronto.ca/festevents.nsf/all?openform";
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->get($url);
my $title = $mechanize->title;
print "<b>$title</b><br />";
my @links = $mechanize->links;
foreach my $link (@links) {
# Retrieve the link URL
my $href = $link->url_abs;
#
# $URL1= get("$link");
#
my $ua = LWP::UserAgent->new;
my $response = $ua->get($href);
unless($response->is_success) {
die $response->status_line;
}
my $URL1 = $response->decoded_content;
die Dumper($URL1);
#This part of the code is just to "clean up" the text
$Format=HTML::FormatText->new;
$TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL1);
$Parsed=$Format->format($TreeBuilder);
open(FILE, ">TorontoParties.txt");
print FILE "$Parsed";
close (FILE);
}
Please help me! I am desperate! If possible please explain to me the logic behind each step? I have been frying my brain on this for a week and I want help seeing other peoples logic behind the problems.