I've never used WebScraper, but its behavior seems broken or just odd.
The following XPath expressions more or less should work (small adjustment is needed) for both cases:
//div//strong/text()
//div//br/following-sibling::text()
When plugging these into xmllint
(libxml2):
tmp >xmllint --html --shell a.html
/ > cat /
-------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
</p>
<p>
<strong>TITLE2</strong>
<br>
DESCRIPTION2
</p>
<p>
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
</body></html>
/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content=TITLE1
2 TEXT
content=TITLE2
3 TEXT
content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content= DESCRIPTION1
2 TEXT
content= DESCRIPTION2
3 TEXT
content= DESCRIPTION3
/ > load b.html
/ > cat /
-------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<p>
<strong>TITLE1</strong>
<br>
DESCRIPTION1
<strong>TITLE2</strong>
<br>
DESCRIPTION2
<strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div></body></html>
/ > xpath //div//strong/text()
Object is a Node Set :
Set contains 3 nodes:
1 TEXT
content=TITLE1
2 TEXT
content=TITLE2
3 TEXT
content=TITLE3
/ > xpath //div//br/following-sibling::text()
Object is a Node Set :
Set contains 5 nodes:
1 TEXT
content= DESCRIPTION1
2 TEXT
content=
3 TEXT
content= DESCRIPTION2
4 TEXT
content=
5 TEXT
content= DESCRIPTION3
When you plug various versions of these into WebScraper, they don't work.
process '//div', 'test[]' => scraper {
process '//strong', 'name' => 'TEXT';
process '//br/following-sibling::text()', 'desc' => 'TEXT';
};
Results in:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
process '//div', 'test[]' => scraper {
process '//div//strong', 'name' => 'TEXT';
process '//div//br/following-sibling::text()', 'desc' => 'TEXT';
};
Results in:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
{ test => [{ desc => " DESCRIPTION1 ", name => "TITLE1" }] }
Even the most basic case:
process 'div', 'test[]' => scraper {
process 'strong', 'name' => 'TEXT';
};
Results in:
/tmp >for f in a b; do perl bs.pl file:///tmp/$f.html; done
{ test => [{ name => "TITLE1" }] }
{ test => [{ name => "TITLE1" }] }
Even when you tell it to use libxml2 via use Web::Scraper::LibXML
-nothing!
To make sure I wasn't going insane I tried it using Ruby's Nokogiri:
/tmp >for f in a b; do ruby -rnokogiri -rpp -e'pp Nokogiri::HTML(File.read(ARGV[0])).css("div p strong").map &:text' $f.html; done
["TITLE1", "TITLE2", "TITLE3"]
["TITLE1", "TITLE2", "TITLE3"]
What am I missing.