I am trying to scrap some websites by using htmlunit 2.16. Websites content are bit heavy and having pages around 5000. I am getting Java heap space issue after some page being scrapped. I have allocated -Xms1500m and -Xmx3000m. But after running 30/45 mins it throws java out of memory. Here is my example:
try (WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38)) {
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(false);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.getOptions().setAjaxController(new NicelyResynchronizingAjaxController());
// Get 1st page Data
HtmlPage currentPage = webClient.getPage("www.example.com");
for (int i = 0; i < 5000; i++) {
try {
HtmlElement next = (HtmlElement) currentPage
.getByXPath("//span[contains(text(),'Next')]")
.get(0);
currentPage = next.click();
webClient.waitForBackgroundJavascript(10000);
System.out.println("Got data: " + currentPage.asXml());
} catch (Exception e) {
e.printStackTrace(System.err);
}
}
} catch (Exception e) {
e.printStackTrace(System.err);
}
As we can see i click on the next button to get the content. I have webClient.close()
also. Can anyone faced similar kind of issue ? Does htmlunit has some memory leak ?