How can we scrape missing values from IMDB in R?

Question

library(rvest)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)

Basically the code above is for scraping the titles and ratings of 50 movies. I want missing values also to be included as NAs.

However, it doesn't happen as IMDB hasn't included them in the HTML tag which only has actual values present (I have used SelectorGadget for getting the tags). So the observation count is 50 for titles and just 33 for ratings which is not what I want. I have tried using html_node() along with html_nodes() but R gives an error saying cannot use css and xpath together. I have also tried the trim=TRUE and replace(!nzchar(.), NA) but they don't work either.

Is there a way to solve this and ensure I get 50 ratings (including NAs or empty values)?

score 1 · Accepted Answer · answered Dec 24 '21 at 17:45

You need to perform this parsing in 2 steps. First collect the parent nodes for all 50 of the movies with html_nodes(). Then you parse this collection of nodes with html_node() (without the s) to obtain a result for all 50 including the nodes missing the attribute.

library(rvest)
library(dplyr)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")

#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")

#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()

Note the update from html_node(s) to html_element(s) the current style in rvest 1.0

Thanks for this. Will this also work similarly for other categories like genre or gross box office if they are missing? — siddharth varada, Dec 24 '21 at 18:05
Yes, will work for any child node under the parent node. So release year, genre, runtime, director, etc. should all work. The only time this will not work is if there are multiple sub nodes with the same tag. (please consider accepting this answer to close the question) — Dave2e, Dec 24 '21 at 18:18

Nad Pat · Answer 2 · 2021-12-24T17:30:32.597

We can use ratings-user-rating to get the whole list of ratings,

library(rvest)
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv" 

url %>% read_html() %>% html_nodes('.ratings-user-rating') %>% html_text2()

 [1] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X "
 [4] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X "
 [7] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[10] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[13] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.5/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5/10 X "  
[16] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.7/10 X "
[19] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[22] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.7/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[25] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.3/10 X "
[28] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[31] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[34] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 8.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X "
[37] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X "
[40] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.2/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[43] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[46] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X "
[49] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 1.5/10 X "

We further need to clean the data to get the ratings.

df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_)

 [1] NA     NA     " 3.6" NA     " 4.9" " 4.1" " 7.4" " 4.6" NA     " 7.9" " 3.3" NA     " 6.5" " 6.6" " 5"   " 3.6" NA     " 4.7" " 3.1" " 5.4" NA     " 5.7"
[23] " 5.1" NA     NA     " 6.9" " 4.3" " 6.6" NA     NA     " 4.1" " 4.6" NA     " 5.1" " 8.3" " 7.1" " 5.8" " 3.4" " 3.3" " 3.2" NA     NA     NA     " 4.6"
[45] NA     " 6.8" NA     " 6.6" " 4.4" " 1.5"

Get the movie names,

movie = url %>% read_html() %>%  html_nodes(".lister-item-header a") %>% html_text()

data.frame(Movie = movie, ratings = df)
                                                    Movie Ratings
1                                              #1915House    <NA>
2                                              #Bodygoals    <NA>
3                                               #Followme     3.6
4                                             #FullMethod    <NA>
5                                                   #Like     4.9
6                                             #SquadGoals     4.1

Can you explain what you did? How did you get the '.ratings-user-rating' tag? Also how does html_text2() work? I am unable to put the ratings into a data frame. This is printing it directly into the console. — siddharth varada, Dec 24 '21 at 17:20
Modified my answer. `html_text2()` provides us with line breakers such as `\n`, so that it makes us easy to work with strings. — Nad Pat, Dec 24 '21 at 17:31
I am unable to clean the data. Here are the errors: df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_) Error in as.character(x) : cannot coerce type 'closure' to vector of type 'character' Even the data frame doesn't work: Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘"function"’ to a data.frame — siddharth varada, Dec 24 '21 at 17:51

How can we scrape missing values from IMDB in R?

2 Answers2