3

I need to extract the HTML of a web page I'm using HTTPuilder in groovy, making the following get:

def http = new HTTPBuilder('http://www.google.com/search')
http.request(Method.GET) {
 requestContentType = ContentType.HTML
 response.success = { resp, reader ->
  println "resp: " + resp
  println "READER: " + reader
 }
 response.failure = { resp, reader ->
  println "Failure"
 }
}

The response I get, does not contain the same html I can see when I explore the html source of www.google.com/search. In fact, it's neither an html, and does not contains the same info I can see in the html source of the page. I've tried setting differents headers (for example, headers.Accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', headers.Accept = 'text/html', seting the user-agent, etc), but the result is the same. How can I get the html of www.google.com/search (or any web page) using http builder?

Perception
  • 79,279
  • 19
  • 185
  • 195

2 Answers2

0

Because the httpbuilder will auto parse the result by the content type. to get the raw html, try to get text from Entity

def htmlResult = http.get(uri: url, contentType: TEXT){ resp->
    return resp.getEntity().getContent().getText()
}
Rick Li
  • 1,457
  • 2
  • 14
  • 19
0

Why use httpBuilder? You might instead use

def url = "http://www.google.com/".toURL() 

println url.text`

to extract the content of the webpage

Vamsi Emani
  • 10,072
  • 9
  • 44
  • 71