0

I'm trying to retrieve this page using Apache HttpClient: http://quick-dish.tablespoon.com/

Unfortunately, when I try to do this, it just returns the following (as returned by JSoup, so probably it's really just returning the HTTP... string itself):

<html>
 <head></head>
 <body>
  HTTP/1.1 200 OK [Server: nginx/1.0.11, Content-Type: text/html;charset=UTF-8, Last-Modified: Mon, 02 Jul 2012 15:30:40 GMT, Vary: Accept-Encoding, Cookie,Accept-Encoding, X-Powered-By: PHP/5.3.6, X-Pingback: http://quick-dish.tablespoon.com/xmlrpc.php, X-Powered-By: ASP.NET, Content-Encoding: gzip, X-Blz: lb1.blaze.io, Date: Mon, 02 Jul 2012 16:06:21 GMT, Content-Length: 11723, Connection: keep-alive]
 </body>
</html>

Here is my code (note that I'm emulating the Google Bot as I've found that web servers tend to be better behaved that way):

URL sourceURL = new URL("http://quick-dish.tablespoon.com/");
HttpClient httpClient =  new ContentEncodingHttpClient();
httpClient.getParams().setBooleanParameter("http.protocol.handle-redirects", true);

final HttpGet httpget = new HttpGet(sourceURL.toURI());
httpget.setHeader("User-Agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)");
httpget.setHeader("Accept", "text/html");
httpget.setHeader("Accept-Charset", "utf-8");

final HttpResponse response = httpClient.execute(httpget);
return Jsoup.parse(response.toString());

Needless to say, the page returns fine in my web browser. Any ideas?

sanity
  • 35,347
  • 40
  • 135
  • 226

2 Answers2

2

Instead of toString you need to get the response entity

// Get hold of the response entity
 HttpEntity entity = response.getEntity();

Then you can get the contents of that

kapa
  • 77,694
  • 21
  • 158
  • 175
Shaun Hare
  • 3,771
  • 2
  • 24
  • 36
0
HttpEntity entity = response.getEntity();
String pageHTML = EntityUtils.toString(entity);
Jsoup.parse(response.toString());
chloe
  • 11
  • 1