R: readLines on a URL leads to missing lines

Question

When I readLines() on an URL, I get missing lines or values. This might be due to spacing that the computer can't read.

When you use the URL above, CTR + F finds 38 instances of text that matches "TV-". On the other hand, when I run readLines() and grep("TV-", HTML) I only find 12.

So, how can I avoid encoding/ spacing errors so that I can get complete lines of the HTML?

What information do you want to extract from the page. BTW , the link you have shared I could not find any instance of "TV-" on that page using CTR + F. — Ronak Shah, Oct 27 '20 at 04:14
@RonakShah Thank you. I am trying to pull all titles of TV shows shot in Vancouver, Canada. The IMDB link should have several "TV-" strings such as TV-MA, TV-14 etc. I have partially working code that can do this. First, I index where "TV-" is. Then take the title which is 4 lines above. Unfortunately, readLines() is skipping some lines or leaving values blank because it doesn't know what it's reading. — 42Cosmic, Oct 27 '20 at 05:06

score 0 · Accepted Answer · answered Oct 27 '20 at 05:11

0

You can use rvest to scrape the data. For example, to get all the titles you can do :

library(rvest)

url <- 'https://www.imdb.com/search/title/?locations=Vancouver,%20British%20Columbia,%20Canada&start=1.json'
url %>%
  read_html() %>%
  html_nodes('div.lister-item-content h3 a') %>%
  html_text() ->  all_titles

all_titles

# [1] "The Haunting of Bly Manor"               "The Haunting of Hill House"             
# [3] "Supernatural"                            "Helstrom"                               
# [5] "The 100"                                 "Lucifer"                                
# [7] "Criminal Minds"                          "Fear the Walking Dead"                  
# [9] "A Babysitter's Guide to Monster Hunting" "The Stand"   
#...                 
#...

answered Oct 27 '20 at 05:11

Ronak Shah

377,200
20
156
213

Thank you. This works to get all titles. I'd need to run html_nodes() again and append a new column that contains the "TV-" or whatever is inside the CSS selector. Lastly, I'd just filter out the rows that don't contain "TV-". Followup question: How come html_nodes('span.certificate') occasionally shows something different than what is displayed on the webpage. For example, the webpage displays "TV-14", but the html_nodes() will output "14+". – 42Cosmic Oct 27 '20 at 06:13
That is weird. I am not sure why that happens. Did you use `html_text()` to extract the text from it? – Ronak Shah Oct 27 '20 at 06:59
Yes, I used html_text() as well. Moving up the code, html_nodes() also has different values. Cross-referencing with the raw HTML code from 'view page source' of Chrome, they are different. For instance, the TV Show: Supernatural is TV-14. However, read_html() sees it as 'PG'. Is this an encoding problem? These are totally different film certifications. – 42Cosmic Oct 27 '20 at 07:28

R: readLines on a URL leads to missing lines

1 Answers1