web scraping in imdb using R

Question

I want to find the link to the top 250 movies in imdb. I decided to find a common pattern by viewing the HTML source code. I found "chttp" but I am not sure if it will get me anywhere. How can I find a pattern to construct the links upon it?

require("XML")
imdb="http://www.imdb.com/chart/top?sort=ir,desc"
imdb.page=readLines(imdb)
g = grep(pattern = "chttp", x = imdb_page) 
imdb.lines=imdb.page[g]

Here's an example output:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"

My main problem is trying to figure out the link(URL) for each of the 250 top movies based on the code I have already written. I basically don't know what's the next step. Also I am not sure the pattern I used the grep command for "chttp" is a good one at all or not.

So according to results starting from index 3 the movie titles are on the odd indices:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"
> imdb.lines[2]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0111161/?ref_=chttp_tt_1\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[3]
[1] "    <a href=\"/title/tt0111161/?ref_=chttp_tt_1\" title=\"Frank Darabont (dir.), Tim Robbins, Morgan Freeman\" >The Shawshank Redemption</a>"
> imdb.lines[6]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0071562/?ref_=chttp_tt_3\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[4]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0068646/?ref_=chttp_tt_2\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_SX34_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[5]
[1] "    <a href=\"/title/tt0068646/?ref_=chttp_tt_2\" title=\"Francis Ford Coppola (dir.), Marlon Brando, Al Pacino\" >The Godfather</a>"
> imdb.lines[7]
[1] "    <a href=\"/title/tt0071562/?ref_=chttp_tt_3\" title=\"Francis Ford Coppola (dir.), Al Pacino, Robert De Niro\" >The Godfather: Part II</a>"
> imdb.lines[9]
[1] "    <a href=\"/title/tt0468569/?ref_=chttp_tt_4\" title=\"Christopher Nolan (dir.), Christian Bale, Heath Ledger\" >The Dark Knight</a>"
> imdb.lines[10]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0110912/?ref_=chttp_tt_5\" ><img src=\"http://ia.media-imdb.com/images/M/MV5BMjE0ODk2NjczOV5BMl5BanBnXkFtZTYwNDQ0NDg4._V1_SY50_CR0,0,34,50_.jpg\" width=\"34\" height=\"50\" />"

This is pretty straightforward with `xpath`. For titles, try: `library(XML); tt <- htmlParse('http://www.imdb.com/chart/top?sort=ir,desc'); xpathSApply(tt, "//td[@class='titleColumn']//a", xmlValue)`. Also look at `xpathSApply(tt, "//td[@class='titleColumn']//a", xmlAttrs)`. — jbaums, Apr 01 '14 at 08:39
I liked this but I need the URLs not the movie names. Thanks. — Mona Jalal, Apr 01 '14 at 08:43
Yes, sorry - my speedreading of your question. I edited the above comment. The first row of the result of the second code block gives the urls. i.e. `xpathSApply(tt, "//td[@class='titleColumn']//a", xmlAttrs)[1, ]`. — jbaums, Apr 01 '14 at 08:44
See my previous comment. Just subset the matrix to the first row. Or `cbind` the titles to the transposed (i.e. `t()`) attributes matrix. — jbaums, Apr 01 '14 at 08:54

score 2 · Accepted Answer · answered Apr 01 '14 at 09:13

2

xpath makes jobs like this trivial.

library(XML)
tt <- htmlParse('http://www.imdb.com/chart/top?sort=ir,desc')
cbind(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlValue),
           t(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlAttrs)))

The first argument to cbind returns titles (the text between the a tags) and the second returns the anchors' attributes (href and title, the latter of which in this case contains details about the films' directors).

answered Apr 01 '14 at 09:13

jbaums

27,115
5
79
119

just a quick question. How can I extract just something like `tt0468569` from the URL? – Mona Jalal Apr 01 '14 at 09:26
1

There are a couple of ways, but `sub` is pretty good for that: `sub('.*/title/(.*)/.*', '\\1', s)`, where `s` is a vector of strings that you want to do that for, e.g. the url column of the output of my solution. See [here](http://www.regular-expressions.info/) and [here](http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html) for more details on regular expressions. The `stringr` package has some convenience functions for this type of thing. – jbaums Apr 01 '14 at 09:32

marco · Answer 2 · 2014-04-01T09:10:29.057

1

What about using the alternative interfaces?

Edit #1: I have looked into some of the files and there don't seem to be any links or even the imdb ID, there should be another way though.

Edit #2: OK, there is no other way apparently, but somebody already did something. E.g. this guy; have a look.

edited Apr 01 '14 at 09:10

answered Apr 01 '14 at 09:00

marco

806
1
7
17

Thank you. I liked this answer! – Mona Jalal Apr 01 '14 at 09:14

web scraping in imdb using R

2 Answers2