0

How can I check if a page contains a specific word. Example: I want to return true or false if the page contains the word "candybar". Notice that the "candybar" could be in between tags (candybar) sometimes and sometimes not. How do I accomplish this?

Here is my code for "grabing" the site (just dont now how to check through the site):

#!/usr/bin/perl -w

use utf8;

use RPC::XML;
use RPC::XML::Client;
use Data::Dumper;
use Encode;
use Time::HiRes qw(usleep);

print "Content-type:text/html\n\n";

use LWP::Simple; 

$pageURL = "http://example.com"; 

$simplePage=get($pageURL);

if ($simplePage =~ m/candybar/) {   
 print "its there!";
}
Fredrik
  • 627
  • 6
  • 14
  • 28
  • 1
    What happens when you run this? – Ilion May 16 '12 at 22:46
  • It would be a good idea to check firstly if your request was successful and you got the content you was expecting. – ArtMat May 16 '12 at 23:11
  • This seems fine to me, apart from you missing `use strict` and `use warnings` from the head of your program. (It is polite to include both of these before asking for help.) I also suggest a line `defined $simplePage or die "Can't get URL";` after the `get` call. What goes wrong with this program as it is? – Borodin May 17 '12 at 00:31
  • 1
    One thing I suggest, apart what everyone else said is to use \b in your regex to indicate word boundary. `m/\bcandybar\b/`. That is if you don't want iwantcandybar to match. – Hameed May 17 '12 at 01:20

1 Answers1

1

I'd suggest that you use some kind of parser, if you're looking for words in HTML or anything else that's tagged in a known way [XML, for example]. I use HTML::Tokeparser but there's many parsing modules on CPAN.

I've left the explanation of the returns from the parser as comments, in case you use this parser. This is extracted from a live program that I use to machine translate the text in web pages, so I've taken out some bits and pieces.

The comment above about checking status and content of returns from LWP, is very sensible too, if the website is off-line, you need to know that.

open( my $fh, "<:utf8", $file ) || die "Can't open $file : $!";

my $p = HTML::TokeParser->new($fh) || die "Can't open: $!";

$p->empty_element_tags(1);    # configure its behaviour
# put output into here and it's cumulated
while ( my $token = $p->get_token ) {
    #["S",  $tag, $attr, $attrseq, $text]
    #["E",  $tag, $text]
    #["T",  $text, $is_data]
    #["C",  $text]
    #["D",  $text]
    #["PI", $token0, $text
    my ($type,$string) = get_output($token) ;             
    # ["T",  $text, $is_data] : rule for text
    if ( $type eq 'T' && $string =~ /^candybar/ ) {

    }
Hugh Barnard
  • 352
  • 2
  • 12