Java screen scraping with JTidy - Parsing HTML values

Question

So what I'm trying to accomplish is scraping an IMDB webpage for data from webseries. Problem is when I convert the page to a DOM object and try to get values it's not as easy as it looks.

For instance: I use getElementsByTagName("h1") -> it returns 1 value so I know what value I can get (in this instance the name of the show). But when I want to extract the show rating it's buried in Div's and very hard to look for. So I try using getElementById(id of the element) to get the element (div) of that id so I can shorten the search.

But it returns a null value? What would be the easiest way to scrape such a page?

Here's a code snippet public final class IMDBExtractor { private String imdbId;

public IMDBExtractor(String imdbId) {
    this.imdbId = imdbId;
}

public synchronized TvShow extractTvShow() throws IOException {
    TvShow show = new TvShow();

    //access imdb url
    URL url  = new URL("http://www.imdb.com/title/" + imdbId);
    URLConnection uc = url.openConnection();
    uc.addRequestProperty("User-Agent",
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
    uc.connect();

    //Tidy up HTML
    Tidy tidy = new Tidy();
    tidy.setXmlOut(true);
    tidy.setShowWarnings(false);
    Document doc = tidy.parseDOM(uc.getInputStream(), null);
    //Set show attributes
     show.setImdbId(imdbId);
     show.setTitle(extractSeriesName(doc));
     show.setRating(extractRating(doc));
    return show;
}

private String extractSeriesName(Document doc) throws IOException {
  return doc.getElementsByTagName("h1").item(0).getChildNodes().item(0).getNodeValue();
}

private Double extractRating(Document doc) throws IOException {
    System.out.println(doc.getElementById("content-2-wide").getNodeName());
    return null;
}

}

The page I'm trying to scrape in this case is: Arrow

All imdb pages has the same mockup so that isn't an issue, do you guys know an easy way?

I'd suggest [JSoup](http://jsoup.org/), it, apprently, has a quey like language which is easier then xPath or [Cobra](http://lobobrowser.org/cobra.jsp) which allows you to treat the HTML as XML, which allows you to use xPath. I've had good success with Cobra, but you really need to understand the structure of the structure of the HTML page you are parsing — MadProgrammer, Nov 27 '12 at 20:29

Java screen scraping with JTidy - Parsing HTML values

0 Answers0