So I have an Ember app and I need to take a snapshot for crawling purposes. The Ember app uses Google+ API for singing in. It also has a Youtube video embedded in the index page. I use HtmlUnit v2.15
.
I'm using the following code to initialize HtmlUnit:
// use the headless browser to obtain an HTML snapshot
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(true);
webClient.getOptions().setAppletEnabled(true);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(originalUrl);
// important! Give the headless browser enough time to execute JavaScript
// The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(5000);
// return the snapshot
logger.info("Writing snapshot for URL: " + originalUrl);
response.getWriter().write(page.asXml());
webClient.closeAllWindows();
Now, I have one issue that happens with all 3 major browser versions (CHROME, INTERNET_EXPLORER_11, FIREFOX_24):
runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: *:x).] sourceName=[http://www.domain.com/assets/vendor.js] line=[1351] lineSource=[null] lineOffset=[0]
Snipet from vendor.js:
// Opera 10-11 does not throw on post-comma invalid pseudos
div.querySelectorAll("*,:x"); // line 1351 is the problem
rbuggyQSA.push(",.*:");
Then, I have the following type of error only with FIREFOX_24 and INTERNET_EXPLORER_11:
Invalid rpc message origin. https://accounts.google.com vs http://www.domain.com Invalid rpc message origin. https://apis.google.com vs http://www.domain.com
This happens only in INTERNET_EXPLORER_11:
runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[https://s.ytimg.com/yts/jsbin/www-embed-player-vflWiCusa/www-embed-player.js] line=[59] lineSource=[null] lineOffset=[0]
Lastly, this happens only in CHROME:
[com.gargoylesoftware.htmlunit.javascript.host.xml.XMLHttpRequest.open(XMLHttpRequest.java:534)]Unable to initialize XMLHttpRequest using malformed URL 'chrome-extension://boadgeojelhgndaghljhdicfkmllpafd/cast_sender.js'.
Also, If I want to check the result of HtmlUnit processing inside a web browser (Chrome Linux in this case), the resulting page is not rendered, it's just:
This page contains the following errors:
error on line 23 at column 5: Encoding error
Below is a rendering of the page up to the first error.
embed[type*="application/x-shockwave-flash"],embed[src*=".swf"],object[type*="application/x-shockwave-flash"],object[codetype*="application/x-shockwave-flash"],object[src*=".swf"],object[codebase*="swflash.cab"],object[classid*="D27CDB6E-AE6D-11cf-96B8-444553540000"],object[classid*="d27cdb6e-ae6d-11cf-96b8-444553540000"],object[classid*="D27CDB6E-AE6D-11cf-96B8-444553540000"]{ display: none !important;}
UPDATE:
I just updated HtmlUnit
to v2.16
.
The page not rendering at all was partially caused by the flash plugin integration which appears to be fixed in v2.16 as also described below, and by a non-UTF-8 char present in the index page. So partially my bad for that. So the page renders as expected without issues now. Still, some parsing issues remain, as explained below.
- Not fixed using CHROME or FIREFOX_31. Fixed in INTERNET_EXPLORER_11
- Not fixed. Now present also in CHROME, besides FIREFOX_31. Fixed in IE_11.
- Fixed in IE_11.
- Fixed in CHROME.
- New issue with CHROME, FIREFOX_31:
Rhino runtime detected object com.gargoylesoftware.htmlunit.ScriptException: Exception invoking resolve of class com.gargoylesoftware.htmlunit.ScriptException where it expected String, Number, Boolean or Scriptable instance. Please check your code for missing Context.javaToJS() call.
- New issue with IE_11:
runtimeError: message=[An invalid or illegal selector was specified (selector: ':enabled' error: Syntax Error).] sourceName=[http://www.domain.com/assets/vendor.js] line=[1346] lineSource=[null] lineOffset=[0]
Snippet at line 1346:
// FF 3.5 - :enabled/:disabled and hidden elements (hidden elements are still enabled)
// IE8 throws error here and will not see later tests
if ( !div.querySelectorAll(":enabled").length ) {
rbuggyQSA.push( ":enabled", ":disabled" );
}
In conclusion, in the lastest version of HtmlUnit v2.16, IE_11 has only 1 error, while CHROME and FIREFOX_31 have 3. As a result, I will switch to using IE_11 and also change the log threshold for HtmlUnit to FATAL instead of ERROR in order to not be spammed with error emails from that 1 issue. It's better, I'll give you that, but still not perfect. Maybe with next year's update ? :)