First I should say that I don't know Javascript well at all. I'm trying to simulate a click on a hyperlink page from Bloomberg. I want to grab a list of news items (hyperlinks), then simply traverse through the list getting each article title and the article text. This is my code:
public List<String> getBloomNewsHtmlUnit() throws IOException {
String searchString = "Apple";
List<String> bloombergNewsAll = new ArrayList<>();
WebClient webclient = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage mainpage = webclient.getPage("http://www.bloomberg.com/search?query=" + searchString);
HtmlAnchor pageanchor = mainpage.getFirstByXPath("//*[@id=\"content\"]/div/section/section[2]/section[1]/div[2]/div[2]/article/div[1]/h1/a");
webclient.waitForBackgroundJavaScript(50000);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webclient.setCssErrorHandler(new SilentCssErrorHandler());
mainpage = pageanchor.click();
System.out.println("Main page: " + mainpage.asText());
return bloombergNewsAll;
// return bloombergNewsAll;
}
This is the exception:
Sep 11, 2016 9:49:34 AM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: :x).] sourceName=[https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js] line=[153] lineSource=[null] lineOffset=[0]
Exception in thread "main" java.lang.RuntimeException: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:284)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:519)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:386)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:304)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:451)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:436)
at com.jsoup.test.BloombergTest.getBloomNewsHtmlUnit(BloombergTest.java:71)
at com.jsoup.test.BloombergTest.main(BloombergTest.java:37)
Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:921)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:803)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:779)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:975)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:352)
at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:238)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:277)
... 7 more
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3915)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3899)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3924)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3940)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefCallError(ScriptRuntime.java:3956)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2390)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2384)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1342)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:794)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:906)
... 15 more
Java Result: 1
Even if I try to execute the first 4 lines of my code (without any reference to the HtmlAnchor), the same error comes up. I read a few bug reports about this error online but none of the suggested solutions seem to be working in my case:
htmlunit : An invalid or illegal selector was specified
In the SOF question above, I applied the suggested waitForBackgroundJavaScript to the webclient, but this did not solve the problem.
JavaScript Exception in HtmlUnit when clicking at google result page
In this question I tried to add:
JavaScriptEngine engine = webclient.getJavaScriptEngine();
engine.holdPosponedActions();
to the code, but the error was still there.
https://sourceforge.net/p/htmlunit/bugs/1744/
In the above bug report, the solution was suggested as redefining the main page with the select query result. In my case I tried redefining the page with a click() event. My code doesn't get that far and throws the error as soon as I try to define the HtmlPage.
https://sourceforge.net/p/htmlunit/bugs/1661/
This report suggests simply ignoring the warnings, but in my case I'm getting an exception (not just warnings), which prevents the desired output.
I first tried to do the scraping this using Jsoup. This worked fine but Jsoup was giving some erroneous links in between the article text which were not on the original page when I inspected it in Chrome. I suspect that there was a JS or Ajax call which changed the page DOM. This is why i chose to use Htmlunit.
Would appreciate any tips on what I'm doing wrong to get this error and how to correct it. Also, if anybody thinks that it is possible to use Jsoup only to achieve what I want please let me know (I just read that Jsoup doesn't support dynamic changes in the DOM so won't work on its own). Thanks in advance!