0

When I am parsing HTML with HttpBuilder like below, I am not receiving full HTML as I see when I go to that page and inspect. For example an <img> tag is not seen in the file generated.

@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )

import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*

def http = new HTTPBuilder('http://www.google.com') 
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->

    def p = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(s) 
    new File("/Users/../Documents/temp.txt") << p              
}

I am looking to get count of images on that html page by parsing.

Opal
  • 81,889
  • 28
  • 189
  • 210
user1207289
  • 3,060
  • 6
  • 30
  • 66

1 Answers1

0

It happens because when you parse file and display it only the content is displayed - without tags. After running the following script:

@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )

import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*

def http = new HTTPBuilder('http://www.google.com') 
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->

    def p = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(reader.text) 
    new File("lol") << p              
}

lol file contains e.g. the following line:

IMDbMoreAllTitlesTV EpisodesNamesCompaniesKeywordsCharactersQuotesBiosPlotsMovies,

which (part of it) looks before parsing:

    <div class="quicksearch_dropdown_wrapper">
      <select name="s" id="quicksearch" class="quicksearch_dropdown navbarSprite"
              onchange="jumpMenu(this); suggestionsearch_dropdown_choice(this);">
        <option value="all" >All</option>
        <option value="tt" >Titles</option>
        <option value="ep" >TV Episodes</option>
        <option value="nm" >Names</option>
        <option value="co" >Companies</option>
        <option value="kw" >Keywords</option>
        <option value="ch" >Characters</option>
        <option value="qu" >Quotes</option>
        <option value="bi" >Bios</option>
        <option value="pl" >Plots</option>
      </select>
    </div>

If you'd like to view tags, use the following script:

@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' )

import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.Method.GET
import static groovyx.net.http.ContentType.JSON
import groovy.json.*

def http = new HTTPBuilder('http://www.google.com') 
def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader ->

    new File("lol") << reader.text
}
Opal
  • 81,889
  • 28
  • 189
  • 210
  • 1
    It didnt solve the issue. I am getting an `html` file . I am using this `@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.7' ) import groovyx.net.http.HTTPBuilder import static groovyx.net.http.Method.GET import static groovyx.net.http.ContentType.JSON import groovy.json.* def http = new HTTPBuilder('http://www.google.com') def html = http.get(uri: 'http://www.imdb.com/title/tt2004420/', contentType: groovyx.net.http.ContentType.TEXT) { resp, reader -> new File("/Users/../Documents/file1") << reader.text }` – user1207289 Aug 28 '15 at 14:52
  • 1
    Actually when I am opening the file with a different text editor, I can see full html as in code. Above , I was referring to a file with bunch of hyperlinks. So i think it did work. – user1207289 Aug 28 '15 at 17:46