5

So I have an Ember app and I need to take a snapshot for crawling purposes. The Ember app uses Google+ API for singing in. It also has a Youtube video embedded in the index page. I use HtmlUnit v2.15.

I'm using the following code to initialize HtmlUnit:

// use the headless browser to obtain an HTML snapshot
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(true);
webClient.getOptions().setAppletEnabled(true);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(originalUrl);

// important!  Give the headless browser enough time to execute JavaScript
// The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(5000);

// return the snapshot
logger.info("Writing snapshot for URL: " + originalUrl);
response.getWriter().write(page.asXml());
webClient.closeAllWindows();
  1. Now, I have one issue that happens with all 3 major browser versions (CHROME, INTERNET_EXPLORER_11, FIREFOX_24):

    runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: *:x).] sourceName=[http://www.domain.com/assets/vendor.js] line=[1351] lineSource=[null] lineOffset=[0]
    

Snipet from vendor.js:

// Opera 10-11 does not throw on post-comma invalid pseudos
div.querySelectorAll("*,:x"); // line 1351 is the problem
rbuggyQSA.push(",.*:");
  1. Then, I have the following type of error only with FIREFOX_24 and INTERNET_EXPLORER_11:

    Invalid rpc message origin. https://accounts.google.com vs http://www.domain.com
    
    Invalid rpc message origin. https://apis.google.com vs http://www.domain.com
    
  2. This happens only in INTERNET_EXPLORER_11:

    runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[https://s.ytimg.com/yts/jsbin/www-embed-player-vflWiCusa/www-embed-player.js] line=[59] lineSource=[null] lineOffset=[0]
    
  3. Lastly, this happens only in CHROME:

    [com.gargoylesoftware.htmlunit.javascript.host.xml.XMLHttpRequest.open(XMLHttpRequest.java:534)]Unable to initialize XMLHttpRequest using malformed URL 'chrome-extension://boadgeojelhgndaghljhdicfkmllpafd/cast_sender.js'.
    

Also, If I want to check the result of HtmlUnit processing inside a web browser (Chrome Linux in this case), the resulting page is not rendered, it's just:

This page contains the following errors:

error on line 23 at column 5: Encoding error
Below is a rendering of the page up to the first error.

embed[type*="application/x-shockwave-flash"],embed[src*=".swf"],object[type*="application/x-shockwave-flash"],object[codetype*="application/x-shockwave-flash"],object[src*=".swf"],object[codebase*="swflash.cab"],object[classid*="D27CDB6E-AE6D-11cf-96B8-444553540000"],object[classid*="d27cdb6e-ae6d-11cf-96b8-444553540000"],object[classid*="D27CDB6E-AE6D-11cf-96B8-444553540000"]{    display: none !important;}

UPDATE:

I just updated HtmlUnit to v2.16.

The page not rendering at all was partially caused by the flash plugin integration which appears to be fixed in v2.16 as also described below, and by a non-UTF-8 char present in the index page. So partially my bad for that. So the page renders as expected without issues now. Still, some parsing issues remain, as explained below.

  1. Not fixed using CHROME or FIREFOX_31. Fixed in INTERNET_EXPLORER_11
  2. Not fixed. Now present also in CHROME, besides FIREFOX_31. Fixed in IE_11.
  3. Fixed in IE_11.
  4. Fixed in CHROME.
  5. New issue with CHROME, FIREFOX_31:

Rhino runtime detected object com.gargoylesoftware.htmlunit.ScriptException: Exception invoking resolve of class com.gargoylesoftware.htmlunit.ScriptException where it expected String, Number, Boolean or Scriptable instance. Please check your code for missing Context.javaToJS() call.

  1. New issue with IE_11:

runtimeError: message=[An invalid or illegal selector was specified (selector: ':enabled' error: Syntax Error).] sourceName=[http://www.domain.com/assets/vendor.js] line=[1346] lineSource=[null] lineOffset=[0]

Snippet at line 1346:

// FF 3.5 - :enabled/:disabled and hidden elements (hidden elements are still enabled)
            // IE8 throws error here and will not see later tests
            if ( !div.querySelectorAll(":enabled").length ) {
                rbuggyQSA.push( ":enabled", ":disabled" );
            }

In conclusion, in the lastest version of HtmlUnit v2.16, IE_11 has only 1 error, while CHROME and FIREFOX_31 have 3. As a result, I will switch to using IE_11 and also change the log threshold for HtmlUnit to FATAL instead of ERROR in order to not be spammed with error emails from that 1 issue. It's better, I'll give you that, but still not perfect. Maybe with next year's update ? :)

Bogdan Zurac
  • 6,348
  • 11
  • 48
  • 96
  • Downvote? Really? For what? :| – Bogdan Zurac Apr 17 '15 at 09:53
  • #1 is "ok": http://stackoverflow.com/questions/15145108/running-jquery-crashing-on-ie10-win7 It's a handled exception inside jquery. – wholevinski Apr 22 '15 at 12:36
  • Then is there any way to mark it as a warning instead of an error from HtmlUnit? – Bogdan Zurac Apr 22 '15 at 13:13
  • webClient.getOptions().setThrowExceptionOnScriptError(false); – wholevinski Apr 22 '15 at 13:19
  • But then I wouldn't get errors that are actual errors which prevent execution, right? Also, it's already set as false, so doesn't change anything.. – Bogdan Zurac Apr 22 '15 at 13:20
  • That's true, but it looks like the behavior of HtmlUnit is to throw on _any_ exception. The way that jquery handles some things (like #1) is it's a caught exception, so execution doesn't stop. (I don't necessarily agree with this, and neither does this guy: http://bugs.jquery.com/ticket/14123) – wholevinski Apr 22 '15 at 13:58
  • What happens if you increase your "waitForBackgroundJavaScript" btw? Any change in/less errors? – wholevinski Apr 22 '15 at 13:59
  • I agree with what you've said. But I still need to find a way to suppress issues like this, as we have set email handler on errors, and we get spammed every day with emails from the above mentioned issues. Regarding the waitForBackgroundJavascript, I've had it set to 20secs or more, from what I recall, and didn't made any difference. – Bogdan Zurac Apr 22 '15 at 14:01
  • Can you provide the URL of the website? – Ahmed Ashour Apr 27 '15 at 06:43
  • The project is not live yet. Would it make any difference? As I already provided all the details regarding the issues. – Bogdan Zurac May 01 '15 at 09:54

1 Answers1

0

In order to fix the majority of issues mentioned above, update HtmlUnit to v2.16 and set the browser version to INTERNET_EXPLORER_11. In my case only 1 error remained. To get rid of this error from our mailer logs, I have set the log level threshold to FATAL instead of ERROR. In order to do this, add the below line in the log4j.properties file.

log4j.logger.com.gargoylesoftware.htmlunit=FATAL

Also make sure and double check if all characters inside the resulting XML are UTF-8 encoded.

Bogdan Zurac
  • 6,348
  • 11
  • 48
  • 96