I'm programming a generic webcrawler that gets the main content from a given webpage (it has to crawl different pages).
I've tried to achieve this with different tools, among them:
- HtmlUnit: returned me too much scrap when crawling.
- Essence: failed to get the important information on many pages.
- Boilerpipe: retrieves the content successfully, almost perfect results but:
When I try to crawl pages like TripAdvisor instead of the given webpage html it returns the following message:
We noticed that you're using an unsupported browser. The Tripadvisor website may not display properly.We support the following browsers: Windows: Internet Explorer, Mozilla Firefox, Google Chrome. Mac: Safari.
I am using user agent:
private final static String USER_AGENT = "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html)
I've also tried to use different user agents, even mobile ones but I always get the same error, is it related to Javascript maybe?
My code is the following, if needed:
public void getPageName(String urlString) throws Exception {
try (final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
boolean javascriptEnabled = true;
webClient.setRefreshHandler(new WaitingRefreshHandler(TIMEOUT / 1000));
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
webClient.getCache().setMaxSize(0);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(javascriptEnabled);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setTimeout(TIMEOUT);
//Boilerpipe // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
URL url = new URL(urlString);
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = DefaultExtractor.INSTANCE.getText(is);
System.out.println("\n******************\n");
System.out.println(text);
System.out.println("\n******************\n");
writeIntoFile(text);
}
catch (Exception e){
System.out.println("Error when reading page " + e);
}
}