6

I'm looking to gather information from a set of web pages that are all very similarly formatted. I need some information that is loaded onto the page by Javascript after opening. It seems that HTMLUnit is a pretty common tool to do this, so that's what I'm using. It's unfortunately very slow, which is a complaint I've seen across a lot of forums. The webClient.getPage() command is what is taking forever. When I turn off Javascript, it runs quickly, but I need to execute some Javascript commands. I was wondering, is there a way to selectively execute a few Javascript commands instead of all of them?

Alternatively, is there a program that is much faster than HTMLUnit for processing Javascript?

Sam Bobel
  • 1,784
  • 1
  • 15
  • 26

1 Answers1

4

Sort of. You can programatically decide which external JavaScript URLs to load:

HtmlUnit will run all JS embedded on the page, if JavaScript is enabled. However, if certain external URLs are not required, you can choose to not load them.

Here's some code to get your started:

    webClient.setWebConnection(new FalsifyingWebConnection(webClient) {
        @Override
        public WebResponse getResponse(WebRequest request) throws IOException {

            if(request.getUrl().getPath().toLowerCase().equals("some url i don't need ")) {
                return createWebResponse(request, "", "application/javascript");
            }

            return super.getResponse(request);
        }
    });

Setting the below might speed things up too:

    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); 

    webClient.setCssErrorHandler(new SilentCssErrorHandler());

    webClient.setIncorrectnessListener(new IncorrectnessListener() {
        @Override
        public void notify(String s, Object o) { }
    });

    webClient.getCookieManager().setCookiesEnabled(false);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setPrintContentOnFailingStatusCode(false);
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
  • Thanks, I just tried the second part, and it did help a bit. I'll try the first part tomorrow and see how it goes. Any thoughts on other tools to use? It's my understanding that this is a testing kit and therefore works slowly in order to deal well with malformed code. Since I'm using it on websites that clearly work, is there a faster tool you know of? – Sam Bobel May 05 '14 at 22:12
  • @user3598519 you could try phantomJS too. it's pretty fast. HtmlUnit is a bit more robust though. – Neil McGuigan May 05 '14 at 22:32
  • What about using NodeJS? I just started reading about it, it looks like it is fast but maybe has limitations in functionality that I don't know about. For the task of loading a webpage, running a javascript command from the page, and collecting the results, would NodeJS be a faster alternative? – Sam Bobel May 05 '14 at 23:08
  • @SamBobel I don't know enough about Node to comment. – Neil McGuigan May 05 '14 at 23:31
  • I also faced the same issue – Shashank Sep 24 '15 at 04:40