0

I am trying to parse this page.

http://www.reuters.com/article/2015/07/08/us-china-cybersecurity-idUSKCN0PI09020150708

My code looks like this

  WebClient webClient = new WebClient(BrowserVersion.CHROME);
  final HtmlPage page = webClient.getPage("http://www.reuters.com/article/2015/07/08/us-alibaba-singapore-post-idUSKCN0PI03J20150708");
  System.out.println(page.asXml());

It gives me a lot of warnings and a huge call stack. Mostly related to javascript engine. I have used these options

webClient.waitForBackgroundJavaScript(1000000);
webClient.setJavaScriptTimeout(1000000);

But nothing seems to work. This page executes javascript to load the content of the page. I need to wait for the page to load to get the content. Any ideas how I can resolve this issue?

Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
Mark
  • 833
  • 1
  • 9
  • 27

1 Answers1

3

You need to wait just after getting the page, also there is an error of "addImpression" is not defined, I don't know in which JavaScript it is defined.

I feel like you are not using recent version, since there are not lot of warnings.

With latest snapshot I get the content by using:

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    final HtmlPage page = webClient.getPage("http://www.reuters.com/article/2015/07/08/us-alibaba-singapore-post-idUSKCN0PI03J20150708");
    webClient.waitForBackgroundJavaScript(10000);
    System.out.println(page.asText());
}
Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56