2

First I should say that I don't know Javascript well at all. I'm trying to simulate a click on a hyperlink page from Bloomberg. I want to grab a list of news items (hyperlinks), then simply traverse through the list getting each article title and the article text. This is my code:

public List<String> getBloomNewsHtmlUnit() throws IOException {
    String searchString = "Apple";
    List<String> bloombergNewsAll = new ArrayList<>();

    WebClient webclient = new WebClient(BrowserVersion.BEST_SUPPORTED);

    HtmlPage mainpage = webclient.getPage("http://www.bloomberg.com/search?query=" + searchString);

    HtmlAnchor pageanchor = mainpage.getFirstByXPath("//*[@id=\"content\"]/div/section/section[2]/section[1]/div[2]/div[2]/article/div[1]/h1/a");

    webclient.waitForBackgroundJavaScript(50000);
    webclient.getOptions().setThrowExceptionOnScriptError(false);
    webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webclient.setCssErrorHandler(new SilentCssErrorHandler());

    mainpage = pageanchor.click();

    System.out.println("Main page: " + mainpage.asText());

    return bloombergNewsAll;
    //  return bloombergNewsAll;
}

This is the exception:

Sep 11, 2016 9:49:34 AM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: :x).] sourceName=[https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js] line=[153] lineSource=[null] lineOffset=[0]
Exception in thread "main" java.lang.RuntimeException: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:284)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:519)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:386)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:304)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:451)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:436)
at com.jsoup.test.BloombergTest.getBloomNewsHtmlUnit(BloombergTest.java:71)
at com.jsoup.test.BloombergTest.main(BloombergTest.java:37)
Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:921)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:803)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:779)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:975)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:352)
at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:238)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:277)
... 7 more
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3915)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3899)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3924)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3940)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefCallError(ScriptRuntime.java:3956)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2390)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2384)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1342)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:794)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:906)
... 15 more
Java Result: 1

Even if I try to execute the first 4 lines of my code (without any reference to the HtmlAnchor), the same error comes up. I read a few bug reports about this error online but none of the suggested solutions seem to be working in my case:

htmlunit : An invalid or illegal selector was specified

In the SOF question above, I applied the suggested waitForBackgroundJavaScript to the webclient, but this did not solve the problem.

JavaScript Exception in HtmlUnit when clicking at google result page

In this question I tried to add:

JavaScriptEngine engine = webclient.getJavaScriptEngine();
    engine.holdPosponedActions();

to the code, but the error was still there.

https://sourceforge.net/p/htmlunit/bugs/1744/

In the above bug report, the solution was suggested as redefining the main page with the select query result. In my case I tried redefining the page with a click() event. My code doesn't get that far and throws the error as soon as I try to define the HtmlPage.

https://sourceforge.net/p/htmlunit/bugs/1661/

This report suggests simply ignoring the warnings, but in my case I'm getting an exception (not just warnings), which prevents the desired output.

I first tried to do the scraping this using Jsoup. This worked fine but Jsoup was giving some erroneous links in between the article text which were not on the original page when I inspected it in Chrome. I suspect that there was a JS or Ajax call which changed the page DOM. This is why i chose to use Htmlunit.

Would appreciate any tips on what I'm doing wrong to get this error and how to correct it. Also, if anybody thinks that it is possible to use Jsoup only to achieve what I want please let me know (I just read that Jsoup doesn't support dynamic changes in the DOM so won't work on its own). Thanks in advance!

Community
  • 1
  • 1
jay tai
  • 419
  • 1
  • 17
  • 35
  • Not directly related to your problem, but why are you setting a `SilentCssErrorHandler`? Most likely you do not need css at all. So you could disable it: `webClient.getOptions().setCssEnabled(false);` – MrSmith42 Sep 12 '16 at 10:10
  • Are you sure you xpath is correct? Try to log the value of `pageanchor` e.g. `System.err.println(pageanchor.asXml());`. – MrSmith42 Sep 12 '16 at 10:13
  • Thanks for the helpful tips MrSmith. I removed SilentCssErrorHandler. Don't seem to be able to log pageanchor. The exception happens on the getPage statement, before the pageAnchor statement. The application does not output anything except the stack trace. In fact, If i remove all of the lines and just try getPage I will get the exact same exception. Does this suggest some JS library conflict between HtmlUnit and the page? In that case won't the exception always happen whether or not the pageanchor xpath is correct? – jay tai Sep 12 '16 at 10:33
  • 2
    The xpath is correct (for the second headline). Yes, the HtmlUnit engine is limited (Rhino is also rather slow), so it is a conflict between HtmlUnit and used js in the page. My approach usually is: open page in browser with disabled js. If all needed content is there, I use jsoup, otherwise I try it with HtmlUnit. If HtmlUnit fails I use PhantomJS, though it is not plain Java. – Frederic Klein Sep 12 '16 at 10:41

1 Answers1

3

The exception doesn't necessarily mean, that the resulting page is useless, though it might be different in other cases. You have to check the result for the content you are looking for.

To reduce the output of error messages from the javascript engine you can define:

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

The following example selects the first headline, triggers the click event and grabs the resulting page; to verify, that we followed the link, the title is printed out:

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

final WebClient webClient = new WebClient(BrowserVersion.CHROME);

webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(10000);

try {
    HtmlPage page = webClient.getPage("http://www.bloomberg.com/search?query=Apple");

    System.out.println(page.getTitleText());

    ScriptResult result = page.executeJavaScript("document.querySelector(\"#content > div > section > section.search-results__content > section.content-stories > div.search-result-items > div:nth-child(1) > article > div > h1 > a\").click()");

    page = (HtmlPage)result.getNewPage();

    System.out.println(page.getTitleText());

} catch (Exception e) {
    e.printStackTrace();
} finally {
    webClient.close();
}

Since the pages are not populated using javascript, you could also skip HtmlUnit altogether and use a html parser like jsoup:

News class

class News{
    private String title;
    private String href;
    private String content="";

    public String getTitle() {
        return title;
    }

    public String getHref() {
        return href;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    public News(String title, String href){
        this.title=title;
        this.href=href;
    }
}

Example code for grabbing news from the first two pages (adjustable through numberOfResultpages):

List<News> bloombergNewsAll = new ArrayList<>();

String searchString = "Apple";
String searchUrl = "http://www.bloomberg.com/search?query=" + searchString + "&page=";
int numberOfResultpages = 2;
Document doc;

// grab title and href
for (int i = 1; i <= numberOfResultpages; i++) {
    try {
        doc = Jsoup.connect(searchUrl + i)
                .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
                .referrer("http://www.bloomberg.com/").get();
        Elements searchResults = doc.select("#content > div > section > section.search-results__content > section.content-stories > div.search-result-items > div > article > div > h1");
        if(searchResults.isEmpty()) break; // no more searchResults

        for (Element result : searchResults) {
            bloombergNewsAll.add(new News(result.text(), result.select("a").attr("href")));
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

// grab content
for (News news : bloombergNewsAll) {

    try {
        doc = Jsoup.connect(news.href)
                .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
                .referrer("http://www.bloomberg.com/search?query=Apple").get();

        if(news.getHref().contains("bloomberg.com/news/videos")) continue;

        if(news.getHref().contains("bloomberg.com/news/")){
            news.setContent(doc.select("div.article-body__content").text());
        }else if(news.getHref().contains("bloomberg.com/gadfly")){
            news.setContent(doc.select("#article > div.body_ZtDFu > div.container_1KxJx").text());
        }else if(news.getHref().contains("bloomberg.com/view")){
            news.setContent(doc.select("div._31WvjDF17ltgFb1fNB1WqY").text());
        }

    } catch (IOException e) {
        e.printStackTrace();
    }
}

// do something useful with your results
for (News news : bloombergNewsAll) {
    System.out.println(news.getTitle() + "\n\t" + news.getHref() + "\n\t" + (news.getContent().length()>150 ? news.getContent().substring(0, 150) : news.getContent()));
}
Frederic Klein
  • 2,846
  • 3
  • 21
  • 37
  • Thanks for this very useful answer. Before I try it i would like to understand the logic behind this answer. Does HtmlUnit use Javascript to execute page simulations as a general rule and is it recommended to know JS well to use Html unit? How were you able to tell that JS is not being used in the page by inspecting the browser console? Why use numberOfResult pages in the Jsoup example if, say you want to read the results in a Java console or using MVC (ie: JSF or Spring)? – jay tai Sep 12 '16 at 12:47
  • 1
    HtmlUnit is a "GUI-Less browser for Java programs", so javascript events like click() are probably done using the embedded js engine. As I mentioned, I usually try to load a page in a normal browser with disabled javascript (e.g. using uMatrix) and inspect the result. Using executeJavascript is useful, since you can test the javascript code in the developer tools of a browser before using it in your code. NumberOfResultpages in my code is just a limit for how many resultpages of the searchreuslts should be crawled; the number depends on your use case, how often you would crawl, etc. – Frederic Klein Sep 12 '16 at 12:57
  • "Why use numberOfResult pages in the Jsoup example if, say you want to read the results in a Java console or using MVC (ie: JSF or Spring)?" I don't understand that question. could you elaborate? – Frederic Klein Sep 12 '16 at 13:00
  • Yes. Sorry. What does numberOfResultPages actually refer to? Is it the number of pages you want to get back from the search or the number of result pages you think that href click might produce? What decides this integer? Is it an arbitrary guess or is there a specific reason why 2 result pages have been included in your code? – jay tai Sep 12 '16 at 15:12
  • 1
    The 2 is just to limit the output, we are talking about hundreds of pages with search results here. If you would like to parse till there are no more search results, you would do something like (though you might want to process the results in batches then, otherwise you might get memory problems, etc.): // grab title and href int i = 1; while(true){ doc = Jsoup.connect(searchUrl + i)... Elements searchResults = doc.select("..."); if(searchResults.isEmpty()) break; i++; for (Element result : searchResults) { bloombergNewsAll.add(...); } } – Frederic Klein Sep 12 '16 at 15:54