0

I'm programming a generic webcrawler that gets the main content from a given webpage (it has to crawl different pages).

I've tried to achieve this with different tools, among them:

  • HtmlUnit: returned me too much scrap when crawling.
  • Essence: failed to get the important information on many pages.
  • Boilerpipe: retrieves the content successfully, almost perfect results but:

When I try to crawl pages like TripAdvisor instead of the given webpage html it returns the following message:

We noticed that you're using an unsupported browser. The Tripadvisor website may not display properly.We support the following browsers: Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac: Safari.

I am using user agent: private final static String USER_AGENT = "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html)

I've also tried to use different user agents, even mobile ones but I always get the same error, is it related to Javascript maybe?

My code is the following, if needed:

public void getPageName(String urlString) throws Exception {
        try (final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
            boolean javascriptEnabled = true;

            webClient.setRefreshHandler(new WaitingRefreshHandler(TIMEOUT / 1000));
            webClient.setCssErrorHandler(new SilentCssErrorHandler());
            webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
            webClient.getCache().setMaxSize(0);

            webClient.getOptions().setRedirectEnabled(true);
            webClient.getOptions().setUseInsecureSSL(true);
            webClient.getOptions().setJavaScriptEnabled(javascriptEnabled);
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setTimeout(TIMEOUT);
    
    //Boilerpipe // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
            URL  url = new URL(urlString);
            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());
            String text = DefaultExtractor.INSTANCE.getText(is);


            System.out.println("\n******************\n");
            System.out.println(text);
            System.out.println("\n******************\n");

            writeIntoFile(text);

        }
        catch (Exception e){
            System.out.println("Error when reading page  " + e);
        }
    }

1 Answers1

0

We noticed that you're using an unsupported browser. The Tripadvisor website may not display properly.We support the following browsers: Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac: Safari.

Most websites require javascript and usually this kind of message shows that your code does not support javascript.

Maybe you have to give HtmlUnit a second try. And if you have some suggestions or bug reports for HtmlUnit feel free to open issues on github and i will try to help.

RBRi
  • 2,704
  • 2
  • 11
  • 14