0

When trying to load a page through htmlUnit I always get a 301 error, even though the exact same page loads fine in a browser.

The code giving me the error is

public String getPage(String url) {
    try {
        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        webClient.getOptions().setJavaScriptEnabled(false);
        webClient.getOptions().setRedirectEnabled(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        // webClient.getOptions().setTimeout();

        final HtmlPage page = webClient.getPage(url);
        return page.asText();
    } catch (IOException ex) {
        Logger.getLogger(Worker.class.getName()).log(Level.SEVERE, null, ex);
    } catch (FailingHttpStatusCodeException ex) {
        Logger.getLogger(Worker.class.getName()).log(Level.SEVERE, null, ex);
    }
    return null;
}

Where url is http://www.instagram.com/name (also tried https, same error)

The error returned is

> Jul 20, 2015 1:52:20 PM com.gargoylesoftware.htmlunit.WebClient
> printContentIfNecessary INFO: statusCode=[301] contentType=[text/html]
> Jul 20, 2015 1:52:20 PM com.gargoylesoftware.htmlunit.WebClient
> printContentIfNecessary INFO: <html> <head><title>301 Moved
> Permanently</title></head> <body bgcolor="white"> <center><h1>301
> Moved Permanently</h1></center> <hr><center>nginx</center> </body>
> </html>

However, when from my browser I go to http://www.instagram.com/name, it loads up fine. I've heard Jsoup may be useful for what I want to do (getting the text of a page) but I'm more familiar with htmlUnit. If you have a fix for my code, or an alternative method then I'd be happy to try it.

yanana
  • 2,241
  • 2
  • 18
  • 28
  • 1
    webClient.getOptions().setRedirectEnabled(false); <- this will be your problem. 301 is just a redirection. Enable redirections. – skandigraun Jul 20 '15 at 18:09
  • @ram thanks for the quick help! I'll try when I'm back on my computer and let you know if it solves the error – Username123 Jul 20 '15 at 19:07

1 Answers1

2

I just checked in my browser. The 301 is also happening in a normal browser. The 301 is a "moved permanently" redirection to

https://instagram.com/name

You can set the redirection behavior of selenium webdriver through

webClient.getOptions().setRedirectEnabled(true);

About Jsoup:

If the page you are trying to parse is loaded directly and no important DOM elements are populated by AJAX, then Jsoup is indeed the better option. It is much much faster than a selenium instance. I much prefer Jsoup whenever possible. If you need more flexibility getting the pages then you might want to look into Apache HttpClient, which I frequently use to get pages. I still use JSoup, but only for parsing, not for actually getting the page off the net. But if the job is simple and your network access is not hindered by proxies and the likes you may as well simply go with JSoup connections. Selenium is great for testing and for situations when you need to run client side JavaScript. The price of this is its memory hunger and slowness.

luksch
  • 11,497
  • 6
  • 38
  • 53