So what I'm trying to accomplish is scraping an IMDB webpage for data from webseries. Problem is when I convert the page to a DOM object and try to get values it's not as easy as it looks.
For instance: I use getElementsByTagName("h1") -> it returns 1 value so I know what value I can get (in this instance the name of the show). But when I want to extract the show rating it's buried in Div's and very hard to look for. So I try using getElementById(id of the element) to get the element (div) of that id so I can shorten the search.
But it returns a null value? What would be the easiest way to scrape such a page?
Here's a code snippet public final class IMDBExtractor { private String imdbId;
public IMDBExtractor(String imdbId) {
this.imdbId = imdbId;
}
public synchronized TvShow extractTvShow() throws IOException {
TvShow show = new TvShow();
//access imdb url
URL url = new URL("http://www.imdb.com/title/" + imdbId);
URLConnection uc = url.openConnection();
uc.addRequestProperty("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
uc.connect();
//Tidy up HTML
Tidy tidy = new Tidy();
tidy.setXmlOut(true);
tidy.setShowWarnings(false);
Document doc = tidy.parseDOM(uc.getInputStream(), null);
//Set show attributes
show.setImdbId(imdbId);
show.setTitle(extractSeriesName(doc));
show.setRating(extractRating(doc));
return show;
}
private String extractSeriesName(Document doc) throws IOException {
return doc.getElementsByTagName("h1").item(0).getChildNodes().item(0).getNodeValue();
}
private Double extractRating(Document doc) throws IOException {
System.out.println(doc.getElementById("content-2-wide").getNodeName());
return null;
}
}
The page I'm trying to scrape in this case is: Arrow
All imdb pages has the same mockup so that isn't an issue, do you guys know an easy way?