0

I am trying to parsing a website wiht HtmlUnit and Jsoup and i facing this problem. I have different pages to parse and I stored this links of this pages in a string array. I want to loop on array's length and parse each page and i proceed in this way.

1) For loop on the length of link's array 2) Opening new webclient 3) Creating new HtmlPage from link with getPage method 4) Parsing and getting some elements 5) Closing webclient 6) go back to 2).

In this way, i'm obtaining what I want, but code it's little bit slow. So i tried to open and close webClient outside the for loop. Like this:

1) Opening new webclient 2) For loop on the length of link's array 3) Creating new HtmlPage from link with getPage method 4) Parsing and getting some elements 5) go back to 2). 6) Closing webclient

It's much more faster but i'm not obtaining same results of previous way.

Is it wrong to use webclient constructor in this way?

EDIT: Following the code I'm testing:

    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        // TODO Auto-generated method stub
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

        String[] links = {"http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;6",
                            "http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;9"};

        String bm = null;
        String[] odds = new String[2]; 

        //Second way
        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        System.out.println("Client opened");
        for (int i=0; i<links.length; i++) {

            HtmlPage page = webClient.getPage(links[i]);
            System.out.println("Page loaded");
            Document csDoc = Jsoup.parse(page.asXml());
            System.out.println("Page parsed");

            Element table = csDoc.select("table.table-main.detail-odds.sortable").first();
            Elements cols = table.select("td:eq(0)");

            if (cols.first().text().trim().contains("bet365.it")) {
                bm = cols.first().text().trim();
                odds[i]=table.select("tbody > tr.lo").select("td.right.odds").first().text().trim();
            }
            else {
                Elements footTable = csDoc.select("table.table-main.detail-odds.sortable");
                Elements footRow = footTable.select("tfoot > tr.aver");
                odds[i] = footRow.select("td.right").text().trim();

                bm = "AVG";
            }
            webClient.close();  
        }

        System.out.println(bm +"\t" +odds[0] + "\t" + odds[1]);

}

If i run this code results are right. If i move webClient.close(); outside the for loop results are not correct. In particular odds[0] is equal to odds[1];

  • `but i'm not obtaining same results of previous way` is too general; please be more precise, in what way the results are different. – Frederic Klein Nov 24 '16 at 11:31
  • Are you getting any `Exception`(s)? – Shyam Baitmangalkar Nov 24 '16 at 11:38
  • @FredericKlein in for loop i'm getting some data that i store in odds array. if i run the code i posted that value all elements of the array is the same, while if i run the same code but creating webclient inside the for loop the elements of the array are all differents (as it should be). Example: 1) with webclient inside the for loop i obtain (for exemple) odds[0] = 4.00 odds[1] = 3.00 odds[2] = 5.50 odds[3] = 7.50 2) with webclient outside the for loop i obtain (for exemple) odds[0] = 4.00 odds[1] = 4.00 odds[2] = 4.50 odds[3] = 4.00 It seems like it's not loading the right page – Lorenzo Dusty Costa Nov 24 '16 at 13:17
  • @ShyamBaitmangalkar no, code it's working. just array odds contains wrong value, as i explained in the comment above – Lorenzo Dusty Costa Nov 24 '16 at 13:38

1 Answers1

2

Think about WebClient as the replacement of your browser. Creating a new WebClient is like starting a new browser. If you like to do something equal to open a new tab in your browser, you can use WebClient#openWindow(..). And from the memory point of view it is a good idea to close the window if you are done.

If you are looking for performance, why you re-parse the whole page Jsoup. HtmlUnit retrieves the page, parses the page, creates the whole DOM and runs the javascript on top of this dom before your are getting back the page from your getPage call. Then you are using HtmlUnit to serialize the Dom tree back to Html and use Jsoup to parse the page again. HtmlUnit offers many ways to search for elements on a page. I'm suggesting to use this API directly on the page you got.

RBRi
  • 2,704
  • 2
  • 11
  • 14
  • Thanks for your anwer, expecially for suggestion about doing everything with HtmlUnit. I will look to API deeply. Regarding WebClient. I was exactly thinking to it as a broswer and that's why it comes in mind the "second way". If I'm running a browser and I have finished to look at a page and I need to go in a new one, I just change address, instead of closing the browser opening it again or going in a new tab. – Lorenzo Dusty Costa Nov 24 '16 at 16:33