0

This question is different from but related to How to Parse this HTML with Web::Scraper?.

I have to scrape a page using Web::Scraper where the HTML can change slightly. Sometimes it can be

<div>
  <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
  </p>
  <p>
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
  </p>
  <p>
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
  </p>
</div>

Which I am extracting with Web::Scraper with the following code

my $test = scraper {
    process 'div p', 'test[]' => scraper {
        process 'p strong', 'name' => 'TEXT';
        process '//p/text()', 'desc' => [ 'TEXT', sub { s/^\s+|\s+$//g } ];
    };
};

But sometimes it contains the following HTML instead (notice that each title and description is no longer separated by <p>).

<div>
  <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
  </p>
</div>

How can I scrape the HTML above into

test => [
  { desc => "DESCRIPTION1 ", name => "TITLE1" },
  { desc => "DESCRIPTION2 ", name => "TITLE2" },
  { desc => "DESCRIPTION3 ", name => "TITLE3" },
]

I have tried modifying the code above but I cannot work out what HTML to use to 'split' the unique title and description pairs.

Community
  • 1
  • 1
user1768233
  • 1,409
  • 3
  • 20
  • 28
  • 1
    I am always suspicious of questions that ask about web scraping. Either it is your own data that, because of some calamity, your only remaining copy is formatted in HTML, or it is someone else's data that they have not made available to you either directly or through an API. Either of those casts clouds over your intentions, and I really don't want to encourage you along this road unless you explain why you are doing it – Borodin Sep 16 '15 at 01:44
  • Hi, I understand your concerns. This is an old internal Intranet site of ours that I am doing some migration work on. We've had an issue with being locked out from the database and webserver (Angry ex employees after some 're-structuring' ) so I thought I would try to scrape the data out. There is nothing unethical or illegal happening, I am just trying to assist our company as they are struggling to get access to the DB. I don't really have a way to prove this obviously. – user1768233 Sep 16 '15 at 01:55
  • Yeah, there is no way to prove it, but the company can legally sue the "Angry ex employees" and force them to give back access info plus pay for the damage. If the database is very small and not important, then scraping the data out is OK, but that's rarely the case. In most cases, the database has several tables with relationships and that's not easy to scrape out. – Racil Hilan Sep 16 '15 at 09:41

2 Answers2

1

I've never used WebScraper, but its behavior seems broken or just odd.

The following XPath expressions more or less should work (small adjustment is needed) for both cases:

//div//strong/text()
//div//br/following-sibling::text()

When plugging these into xmllint (libxml2):

tmp >xmllint --html --shell a.html
/ > cat /
 -------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
  <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
  </p>
  <p>
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
  </p>
  <p>
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
  </p>
</div>
</body></html>

/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1  TEXT
    content=TITLE1
2  TEXT
    content=TITLE2
3  TEXT
    content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 3 nodes:
1  TEXT
    content=     DESCRIPTION1
2  TEXT
    content=     DESCRIPTION2
3  TEXT
    content=     DESCRIPTION3

/ > load b.html
/ > cat /
 -------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
    <p>
    <strong>TITLE1</strong>
    <br>
    DESCRIPTION1
    <strong>TITLE2</strong>
    <br>
    DESCRIPTION2
    <strong>TITLE3</strong>
    <br>
    DESCRIPTION3
    </p>
</div></body></html>

/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1  TEXT
    content=TITLE1
2  TEXT
    content=TITLE2
3  TEXT
    content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 5 nodes:
1  TEXT
    content=  DESCRIPTION1
2  TEXT
    content=
3  TEXT
    content=  DESCRIPTION2
4  TEXT
    content=
5  TEXT
    content=  DESCRIPTION3

When you plug various versions of these into WebScraper, they don't work.

 process '//div', 'test[]' => scraper {
    process '//strong', 'name' => 'TEXT';
    process '//br/following-sibling::text()', 'desc' => 'TEXT';
  };

Results in:

/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }

process '//div', 'test[]' => scraper {
  process '//div//strong', 'name' => 'TEXT';
  process '//div//br/following-sibling::text()', 'desc' => 'TEXT';
};

Results in:

/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }

Even the most basic case:

  process 'div', 'test[]' => scraper {
    process 'strong', 'name' => 'TEXT';
  };

Results in:

/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ name => "TITLE1" }] }
{ test => [{ name => "TITLE1" }] }

Even when you tell it to use libxml2 via use Web::Scraper::LibXML -nothing!

To make sure I wasn't going insane I tried it using Ruby's Nokogiri:

 /tmp >for f in a b; do ruby -rnokogiri -rpp -e'pp Nokogiri::HTML(File.read(ARGV[0])).css("div p strong").map &:text' $f.html; done
["TITLE1", "TITLE2", "TITLE3"]
["TITLE1", "TITLE2", "TITLE3"]

What am I missing.

  • I believe we are only getting one test,description pair because the first 'process' line is matching on the DIV which means the inner scraper just processes the one DIV, if you use //div/p on the first process then change the inner scrappers processes to match on //p/strong and //p/br/etc... then it works for the first example but not the second. I have something that is kind of working now, but I just need to post process the arrays to get it into the format I want. I think I have it working and will put my answer in below. – user1768233 Sep 16 '15 at 04:00
0

I think I worked it out. I'm not sure if its the best way but it seems to handle both cases.

         my $test = scraper {
         process '//div', 'test' => scraper {
            process '//div//strong//text()', 'name[]' => 'TEXT';
            process '//p/text()','desc[]' => ['TEXT', sub { s/^\s+|\s+$//g} ];

         }
      };



    my $res = $test->scrape(\$html);

    #get the names and descriptions 
    my @keys = @{$res->{test}->{name}};
    my @values = @{$res->{test}->{desc}};

    #merge two arrays into hash
    my %hash;   
    @hash{@keys} = @values;
user1768233
  • 1,409
  • 3
  • 20
  • 28