1

I'm working on a project where I need to harvest some data from website, so I'm using webharvest.

I'm running into a problem where the data I'm harvesting (comments from news websites) is sometimes across more than one page. I'm trying to configure it to look for the link to the second page of comments in the xpath of the webpage. Problem is, if I try an if test, the condition always passes, and if I try a try statement, the try body always succeeds. This results in my script extracting comments from the first page (if there is only one), twice. Articles with two sets of comments work beautifully, however. So my question relates to the syntax of if conditions and try statements. The documentation on Webharvest is scant with regard to these functions.

Here's what I'm trying. First, the if test:

<var-def name="secondPageLink">
    <xpath expression="/a[@class='next']/@href">
        <var name="firstPage"/>
    </xpath>
</var-def>
<case>
    <if condition="${secondPageLink != null}">
        [ process second page ]
    </if>
</case>

Second, the try/catch:

<try>
    <body>
        <var-def name="secondPageLink">
            <xpath expression="/a[@class='next']/@href">
                <var name="firstPage"/>
            </xpath>
        </var-def>
        [ continue to process page ]
    </body>
    <catch>
    </catch>
</try>

The problem with the if test is that despite the fact that the variable is empty when no second page exists (which I can see from the debugging in the gui), the if seems to return true, and runs its body.

I can more easily see why the try/catch doesn't work properly, since an xpath returning no value (if the second page doesn't exist) wouldn't constitute an 'error' as such and the try will still succeed. A further difficulty is that the @href of the next page link is relative, and so needs to be appended to the URL of the first page (or the base URL of the article, actually, but same thing here), meaning that my html-to-xml takes the url ${firstPage}${secondPageLink}, which ends up simply being the first page URL again, and webharvest thus processes the first page a second time.

If someone can reformulate my if test to return false when the secondPageLink xpath returns an empty value, I'd be very appreciative!

Jangari
  • 690
  • 4
  • 12
  • I've also tried testing for the exact string that I expect the secondPageLink to be. So `condition="${secondPageLink == '?page=2'}"`. However, this never returns true, so my two-page articles only return the first page. – Jangari Jul 17 '14 at 02:26

1 Answers1

1

Found an answer.

This person had a similar problem with if, and an answer there suggested using the syntax: condition="${variable.toString().length() > 0}".

So in my code, replacing the if test with:

<case>
    <if condition="${secondPageLink.toString().length() > 0}">
        <var-def name="secondPageFull">
            <html-to-xml>
                <http url="${commentedArticleURL}${secondPageLink}"/>
            </html-to-xml>
[...]                   

produced the correct result.

Community
  • 1
  • 1
Jangari
  • 690
  • 4
  • 12