0

Currently I am writing a Web-scraper based on HTMLunitfor harvesting specific Company names and details from the Hannovermesse trade fair exhibition website. I seem to have encountered a showstopper for my efforts, as i cannot get the page forward button on the search results page to work.

The entry website is
www.hannovermesse.de/en/exhibition/exhibitors-products/advanced-search/
Then a few Search filters are set in Checkboxes (EU Region, Industrial Automation/Robotics).

After submitting the form and loading the search results I get about 400 hits, when I select Exhibitors tab, I receive the first results page. The Search results are displayed on
//www.hannovermesse.de/en/exhibition/exhibitors-products/search
NOTE: You need to run the whole sequence to get to the results screen! It seems to use session/cookie data to determine what to display and by default it displays nothing.

This gives me 20 hits on the first page and displays on the bottom the page selector, with page 1 selected.
"[<][1] 2 ... | n [>]"
In order to harvest all contacts I need to click through all the screens indicated in the search results.

So the idea was to use the right hand button to iterate through the pages and harvest the company details on each page as I go along and terminate the loop when the right button is no longer active. I located the right button with various means like getXPath, verified it and I even modified it by adding a Name attribute on so I could find it with the usual HTMLanchor generating function.

The result is invariably a runtime error and abort.

The log messages are:

Mai 01, 2016 6:05:11 PM com.gargoylesoftware.htmlunit.html.HtmlScript isExecutionNeeded WARNING: Script is not JavaScript (type: text/html, language: ). Skipping execution.
Mai 01, 2016 6:05:12 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: :x).] sourceName=[http://www.hannovermesse.de/files/001-fs5/media/layout/js/dmag.min.js] line=[2] lineSource=[null] lineOffset=[0]
Mai 01, 2016 6:05:12 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '[id='sizzle-1462118712173'] :selected' error: Invalid selector: [id="sizzle-1462118712173"] :selected).] sourceName=[http://www.hannovermesse.de/files/001-fs5/media/layout/js/dmag.min.js] line=[2] lineSource=[null] lineOffset=[0]
Mai 01, 2016 6:05:12 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'text/javascript'.
Mai 01, 2016 6:05:17 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify WARNING: Obsolete content type encountered: 'application/x-javascript'.

Have tried various browser option settings, but no joy. I found that this "invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: :x)."-error has cropped up sometimes with arachnid and another test browsers. There a "webClient().waitForBackgroundJavaScriptStartingBefore(5000);"" fixed the issue. I tried that, but it hasn't worked for me.

I am enclosing my quick and dirty proof-of-concept Java program for your reference. I am using Eclipse MARS with Java JRE 1.8, JUnit4 and HTMLunit 2.22 libs

Anyone has any idea what is going on, or what to change to make it work? I can't believe I am the first one to stumble over this!

My Java code:

/*---------------------------------------------------------------------------------*/
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebClientOptions;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlCheckBoxInput;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlOption;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSelect;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;

public class App {
    static WebClient webClient;

    static String[] countries = {
                "European Union"
                                    };

    static String[] categories = {
                "Robotics"          
    };

    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws Exception {

        setUp();

        HtmlPage currentPage = webClient.getPage("http://www.hannovermesse.de/en/exhibition/exhibitors-products/advanced-search/");
        System.out.println(currentPage.getTitleText()+"Web page open\n------------------------------------------------------------------\n");

        registerCountries(currentPage);
        registerCategories(currentPage);
        System.out.println("Search filters registered\n------------------------------------------------------------------\n");

        currentPage = submitSearchRequest(currentPage);
        System.out.println("Search filters submitted and results loaded\n------------------------------------------------------------------\n");

        selectExhibitorView(currentPage);
        System.out.println("Exhibitor View selected\n------------------------------------------------------------------\n");

        showCriteria(currentPage);

        showResultsCount(currentPage);

        HtmlPage backupPage = currentPage;

        for(int n=0, tn=0; n<1; n++){
            System.out.println("========================================================================================");
            System.out.println(" Results page "+n+1);

            HtmlAnchor nextPageButton = (HtmlAnchor) currentPage.getFirstByXPath(".//div[@class=\"col s-col12 m-col12 l-col12\"]/ul/following-sibling::a");
            String classValue = nextPageButton.getAttribute("class");
            nextPageButton.setAttribute("name", nextPageButton.getAttribute("class").trim());

            NamedNodeMap attribList = nextPageButton.getAttributes();
            for (int i=0; i < attribList.getLength(); i++) {
                Node node = attribList.item(i);
                String key=node.getLocalName();
                String val=node.getNodeValue();             
                System.out.printf("[%-15s] : '%s'\n", key, val);
            }

            List <HtmlElement> elementList = (List<HtmlElement>)currentPage.getByXPath(".//h4[@itemprop=\"name\"]/text()");         
            int i=0;
            for(; i<elementList.size();i++){
                System.out.printf("[%3d] '%s'\n", +(tn+i), elementList.get(i));
            }
            tn=i;

            System.out.println("Next Button :");
            final HtmlAnchor newPageLink  = (HtmlAnchor) currentPage.getAnchorByName(classValue.trim());
            currentPage = (HtmlPage) newPageLink.click();
            currentPage = nextPageButton.click();
            System.out.println("===========>[13]");

        }
        currentPage = backupPage;          

        System.out.println("Done");
        webClient.close();
    }

    private static void showResultsCount(HtmlPage currentPage) {
        String results = "";
        int count;
        results = (String) currentPage.getByXPath("String("+".//div[@class=\"col l-col8 m-col7 s-col12\"]/p[@class=\"query-text\"]/text()"+")").get(0);
        publish("Raw results : "+results);
        count= Integer.parseInt(results.split(" ")[0]);
        publish("Results : "+count+" found.\n");
    }

    private static void selectExhibitorView(HtmlPage currentPage) {
        HtmlSelect select = (HtmlSelect) currentPage.getElementById("searchResult:resultType");
        HtmlOption option = select.getOptionByValue("1");
        select.setSelectedAttribute(option, true);      
    }

    private static HtmlPage submitSearchRequest(HtmlPage currentPage) {
        try {       
            final HtmlForm form  = (HtmlForm) currentPage.getFormByName("searchAP:search");
            final HtmlSubmitInput button = form.getInputByName("searchAP:searchButton2");           
            currentPage = (HtmlPage) button.click();
            System.out.println(currentPage.getTitleText());
        } catch (Exception e) {
            System.out.println("===> Cannot submit Search Form, no submit button found!");
        }
        return currentPage; 
    }

    private static void showCriteria(HtmlPage currentPage) {
        publish("Filtercriteria for this search:");
        String results = "";
        results = (String) currentPage.getByXPath("String(.//h1[contains(text(), \"Search Result\")]/following-sibling::p)").get(0);

        String[] criteria= results.split(",");
        String key = "";
        Map<String, ArrayList<String>> cMap = new LinkedHashMap<String, ArrayList<String>>();
        ArrayList<String> value = new ArrayList<String>();
        cMap.put(key, value);

        for(int i=0; i<criteria.length; i++){
            if(criteria[i].contains(":")){
                String workCopy = new String(criteria[i]);
                String[] bits= workCopy.split(":");
                key = bits[0].trim();
                criteria[i]=bits[1].trim();
                value = new ArrayList<String>();
                cMap.put(key, value);
            }
            value.add(criteria[i].trim());
        }  

        for (Map.Entry<String, ArrayList<String>> entry : cMap.entrySet()) {
            key = entry.getKey();
            value = entry.getValue();
            if(!value.isEmpty()){
                System.out.println(key+": ");
                for (int i = 0; i < value.size(); i++) {
                    System.out.println("  "+value.get(i));
                }
            }
        }
    }

    public static void publish(String text) {
        System.out.println(text);       
    }
    public static void registerCountries(HtmlPage currentPage) {
        for(int i=0;i < countries.length; i++){
            setCountryCheckbox(currentPage, countries[i]);
        }
    }

    public static void registerCategories(HtmlPage currentPage) {
        for(int i=0;i < categories.length; i++){
            setCategoryCheckbox(currentPage, categories[i]);
        }       
    }

    public static void setCountryCheckbox(HtmlPage currentPage, String text) {
        String label="";
        HtmlCheckBoxInput input;

        try {
            label = (String) currentPage.getByXPath("String(.//label[contains(text(), \""+text+"\")]/@for)").get(0);
            System.out.print(text);
            input = currentPage.getHtmlElementById(label);
            input.setChecked(true);
            System.out.println(": "+(input.isChecked()?"SET":""));
        } catch (Exception e) {
            System.out.println("\rError: Label ID for '"+text+"' not found. ");
        }
    }

    public static void setCategoryCheckbox(HtmlPage currentPage, String text) {
        String label="";
        HtmlCheckBoxInput input;
        String XPathXpression = ".//strong[contains(text(), \""+text+"\")]/parent::div/input/@id";

        try {
            label = (String) currentPage.getByXPath("String("+XPathXpression+")").get(0);
            System.out.print(text+" : "+"'"+label+"' ");
            input = currentPage.getHtmlElementById(label);
            input.setChecked(true);
            System.out.println(": "+(input.isChecked()?"SET":""));
        } catch (Exception e) {
            System.out.println("\rError: Label ID for '"+text+"' not found. ");
        }
    }

    public static void setUp() throws InterruptedException {
          webClient = new WebClient(BrowserVersion.FIREFOX_45);
          WebClientOptions options = webClient.getOptions();
          options.setPrintContentOnFailingStatusCode(true);
          options.setJavaScriptEnabled(true);
          options.setThrowExceptionOnScriptError(false);
          options.setThrowExceptionOnFailingStatusCode(false);
          webClient.waitForBackgroundJavaScriptStartingBefore(5000);          
      }
}
Cœur
  • 37,241
  • 25
  • 195
  • 267
Helmut
  • 11
  • 3
  • This problem seems to be a quirk of the Mozilla Rhino Javascript Engine (extremely picky about acceptable syntax) used in HTMLunit. I ended up chopping off the offending 'return false' statement at the end of the onclick statement and it executed. However the script still didn't deliver the goods, as the Website then kept disconnecting my session. I guess it detected the scraping attempt. So in the end I had to give that up. – Helmut Jun 08 '16 at 22:24

2 Answers2

0

If you use HtmlSubmitInput for a button, HTMLUnit try to find a input type field instead of finding a Button.

Use HtmlButton instead of HtmlSubmitInput

Here is an example.

HtmlButton button = form.getButtonByName("submitButton");

0

Just two hints:

  1. An invalid or illegal selector was specified.... is a really common output when testing web applications using jQuery with HtmlUnit. This means that jQuery does some calls to check the capabilities of the css selector supported by the browser. Because HtmlUnit logs exceptions at the moment of constructions you will see this log output. The exceptions are handled form the (jQuery) java code later on. Usually you can simply ignore it.

  2. webClient.waitForBackgroundJavaScriptStartingBefore(5000); is not a kind of an option. This call does NOT set any wait timeout. You have to place this call inside you normal application flow usually after triggering some actions. This might be required if you trigger Ajax actions.

RBRi
  • 2,704
  • 2
  • 11
  • 14