5

Why HtmlUnit is so much slower than GUI browsers? For instance, HtmlUnit loads this page http://oltexpress.airkiosk.com/cgi-bin/airkiosk/I7/181002i?O2=2 in 14sec (when CSS support is turned off) while FF does it in 5sec (after clearing cache, with CSS support). I know, modern browsers are not so restrictive dealing with bad JS code while HtmlUnit is, but still the time diffrence here is intolerable.

Any ideas how to speed up work with HtmlUnit? Has anyone played with HtmlUnit cache?

biera
  • 2,608
  • 1
  • 24
  • 26

3 Answers3

5

To answer your question on why is it slow:

This is purely because HTMLUnit has many things going against it:

  • It is running in a compiled language which does not have many of the native optimisations of browsers such as FireFox.
  • It requires well formed XML as opposed to HTML(non-strict) which means that it has to convert the HTML into XML.
  • Then it has to run the JavaScript through a parser, fix any problems with the code, then process that inside Java itself.
  • Also as @Arya pointed out, it requests things one at a time, so many javascript files will result in a slow down, many images will result in a slow down.

To answer your question on how to speed it up:

As a general rule I disable(unless they are needed):

  • JavaScript
  • Images
  • CSS
  • Applets.

I also got the source code and removed the ActiveX support and re-compiled. If you want to prevent the code from loading those extra pages you can use the code below to give a response without downloading it from the web.

WebClient browser;
browser.setWebConnection(new WebConnectionWrapper(browser) {
    @Override
    public WebResponse getResponse(final WebRequest request) throws IOException {
        if (/* Perform a test here */) {
            return super.getResponse(request); // Pass the responsibility up.
        } else {
            /* Give the program a response, but leave it empty. */
            return new StringWebResponse("", request.getUrl());
        }
    }
});

Other things I have noticed:

  • HTMLUnit is not thread safe meaning that you should probably create a new one for each thread.
  • HTMLUnit does actually cache the pages
Opal
  • 1,057
  • 8
  • 27
  • "It then has to parse the entire thing into objects, each tag being a separate object(Object creation is expensive)." & "Then it has to run the JavaScript through a parser, fix any problems with the code, then process that inside Java itself." - I am not sure but I think that "normal" browser does it as well. thanks for your answer! – biera Dec 11 '12 at 19:21
  • I see your point. I was just trying to point out that the language and the libraries are not specially designed for high intensity parsing and execution. – Opal Dec 11 '12 at 21:06
  • @Lee Is it possible to speed up the page get in htmlUnit? HtmlPage page1 = webClient.getPage(url); – muthu Jun 19 '13 at 07:23
  • Basically changing the settings on the web client should improve the `.getPage`. Ensuring that no extraneous requests are made ect. However you must remember it is a Testing Library. The project I was working on relied heavily in the actual application. It was slow as anything and I eventually moved to an API System that is now significantly faster. – Opal Jun 20 '13 at 11:01
1

The reason it takes longer with HTMLUnit is that each request is done one by one. That is the main reason why it takes so long to retrieve a page. JS and css should not make a big difference IMO

Arya
  • 8,473
  • 27
  • 105
  • 175
0

WebClient object maintains a cache of static resources. If you close a WebClient object and create another one, you have to rebuild the cache.

To avoid this, you can reuse the WebClient object across multiple sessions, or even maintain a pool of WebClient objects. Also see if you can maintain a Cache object. You may want to clear WebClient's cookies before returning it to pool.

As @Lee pointed, WebConnectionWrapper provides you an opportunity for intercepting. I use it to avoid redirects, disable JS execution for selected resources or return mock data if I do not care for that resource.

Paddy
  • 609
  • 7
  • 25