4

I am trying to scrap some websites by using htmlunit 2.16. Websites content are bit heavy and having pages around 5000. I am getting Java heap space issue after some page being scrapped. I have allocated -Xms1500m and -Xmx3000m. But after running 30/45 mins it throws java out of memory. Here is my example:

try (WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38)) {
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setUseInsecureSSL(true);
    webClient.getCookieManager().setCookiesEnabled(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setPrintContentOnFailingStatusCode(false);
    webClient.setCssErrorHandler(new SilentCssErrorHandler());
    webClient.getOptions().setAjaxController(new NicelyResynchronizingAjaxController());

    // Get 1st page Data
    HtmlPage currentPage = webClient.getPage("www.example.com");

    for (int i = 0; i < 5000; i++) {
        try {
            HtmlElement next = (HtmlElement) currentPage
                .getByXPath("//span[contains(text(),'Next')]")
                .get(0);

            currentPage = next.click();
            webClient.waitForBackgroundJavascript(10000);
            System.out.println("Got data: " + currentPage.asXml());
        } catch (Exception e) {
            e.printStackTrace(System.err);
        }
    }
} catch (Exception e) {
    e.printStackTrace(System.err);
}

As we can see i click on the next button to get the content. I have webClient.close()also. Can anyone faced similar kind of issue ? Does htmlunit has some memory leak ?

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
Sthita
  • 1,750
  • 2
  • 19
  • 38
  • @SeanBright Sorry about typo :) – Sthita Oct 27 '16 at 14:27
  • @SeanBright Thanks for the edit, but we need to add finally{webClient.close();} . This is really important. – Sthita Oct 27 '16 at 14:33
  • 2
    it's handled for you automatically using [try-with-resources](https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html) – Sean Bright Oct 27 '16 at 14:33
  • One observation here. May not be relevant to the OOM problem though. The try-catch within the for loop does not update the loop on exceptions. Imagine `next.click()` fails with an exception and `currentPage` not updated. – neurite Oct 27 '16 at 14:49
  • @neurite Yes you are right. currentpage will not updated if next.click() failed. We can optimize the code, but here i am facing issue while scrapping. – Sthita Oct 27 '16 at 14:53
  • Is this issue resolved, I am getting same OutOfMemoryError : Java heap space with versrion 2.17 – Shamim Ahmad May 09 '18 at 14:26

2 Answers2

1

Maybe the problem is that all the pages are still stored in the history.

I disable the browsing history this way:

    try {
        final History window = webClient.getWebWindows().get(0).getHistory();
        final Field f = window.getClass().getDeclaredField("ignoreNewPages_"); //NoSuchFieldException
        f.setAccessible(true);
        ((ThreadLocal<Boolean>) f.get(window)).set(Boolean.TRUE);
        LOGGER.debug("_dbff772d4d_ disabled history of Webclient");
    }
    catch (final Exception e) {
        LOGGER.warn("_66461112f7_ Can't disable history of Webclient");
    }

I got the idea from how-to-limit-htmlunits-history-size


These configurations are not related to your problem, but where useful in my projects:

    webClient.setJavaScriptTimeout(JAVASCRIPT_TIMOUT);
    webClient.getOptions().setTimeout(WEB_TIMEOUT);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false); 
    webClient.getOptions().setPopupBlockerEnabled(true);
    webClient.setRefreshHandler(new WaitingRefreshHandler(REFRESH_HANDLER_WAIT_LIMIT)); 
Community
  • 1
  • 1
MrSmith42
  • 9,961
  • 6
  • 38
  • 49
1

Please try the latest version of HtmlUnit. We have fixed many memory issues inbetween. At least 2.23 hast some fixes regarding history. Additionally you can now control the history size.

RBRi
  • 2,704
  • 2
  • 11
  • 14