I'm working on a project where I need to harvest some data from website, so I'm using webharvest.
I'm running into a problem where the data I'm harvesting (comments from news websites) is sometimes across more than one page. I'm trying to configure it to look for the link to the second page of comments in the xpath of the webpage. Problem is, if I try an if
test, the condition always passes, and if I try a try
statement, the try
body always succeeds. This results in my script extracting comments from the first page (if there is only one), twice. Articles with two sets of comments work beautifully, however. So my question relates to the syntax of if
conditions and try
statements. The documentation on Webharvest is scant with regard to these functions.
Here's what I'm trying. First, the if
test:
<var-def name="secondPageLink">
<xpath expression="/a[@class='next']/@href">
<var name="firstPage"/>
</xpath>
</var-def>
<case>
<if condition="${secondPageLink != null}">
[ process second page ]
</if>
</case>
Second, the try
/catch
:
<try>
<body>
<var-def name="secondPageLink">
<xpath expression="/a[@class='next']/@href">
<var name="firstPage"/>
</xpath>
</var-def>
[ continue to process page ]
</body>
<catch>
</catch>
</try>
The problem with the if
test is that despite the fact that the variable is empty when no second page exists (which I can see from the debugging in the gui), the if
seems to return true, and runs its body.
I can more easily see why the try
/catch
doesn't work properly, since an xpath returning no value (if the second page doesn't exist) wouldn't constitute an 'error' as such and the try will still succeed. A further difficulty is that the @href of the next page link is relative, and so needs to be appended to the URL of the first page (or the base URL of the article, actually, but same thing here), meaning that my html-to-xml takes the url ${firstPage}${secondPageLink}, which ends up simply being the first page URL again, and webharvest thus processes the first page a second time.
If someone can reformulate my if
test to return false when the secondPageLink xpath returns an empty value, I'd be very appreciative!