HTTPBuilder - How can I get the HTML content of a web page?

Question

I need to extract the HTML of a web page I'm using HTTPuilder in groovy, making the following get:

def http = new HTTPBuilder('http://www.google.com/search')
http.request(Method.GET) {
 requestContentType = ContentType.HTML
 response.success = { resp, reader ->
  println "resp: " + resp
  println "READER: " + reader
 }
 response.failure = { resp, reader ->
  println "Failure"
 }
}

The response I get, does not contain the same html I can see when I explore the html source of www.google.com/search. In fact, it's neither an html, and does not contains the same info I can see in the html source of the page. I've tried setting differents headers (for example, headers.Accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', headers.Accept = 'text/html', seting the user-agent, etc), but the result is the same. How can I get the html of www.google.com/search (or any web page) using http builder?

score 0 · Answer 1 · answered Jan 16 '13 at 05:07

0

Because the httpbuilder will auto parse the result by the content type. to get the raw html, try to get text from Entity

def htmlResult = http.get(uri: url, contentType: TEXT){ resp->
    return resp.getEntity().getContent().getText()
}

answered Jan 16 '13 at 05:07

Rick Li

1,457
2
14
19

score 0 · Answer 2 · answered Aug 22 '11 at 08:11

0

Why use httpBuilder? You might instead use

def url = "http://www.google.com/".toURL() 

println url.text`

to extract the content of the webpage

answered Aug 22 '11 at 08:11

Vamsi Emani

10,072
9
44
71

HTTPBuilder - How can I get the HTML content of a web page?

2 Answers2