1

I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?

Below is the example of the code I am using.

html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                  html_nodes('.VDXfz') %>% 
                  html_attr('href')) %>% 
  mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

  news_dat <- data.frame(
   Title = html_dat %>%
   html_nodes('.DY5T1d') %>% 
   html_text(),
   Link = dat$Link,
   Description =  html_dat %>%
   html_nodes('.Rai5ob') %>% 
   html_text(),
   Time =  html_dat %>%
   html_nodes('.WW6dff') %>%
   html_text() 
 )    
Liang Wu
  • 9
  • 3
  • I could also encounter shorter length in `Description`. I think the issues is when you have unequal length you don't actually know which one is missing to have `NA` at the right place. You can always add `NA` at the start or end to make the length equal but that will not serve the purpose. – Ronak Shah Nov 27 '20 at 04:16
  • Does this answer your question: https://stackoverflow.com/questions/63540089/how-to-get-rid-of-the-error-while-scraping-web-in-r? – Dave2e Nov 27 '20 at 04:40
  • @Dave2e I saw this post, but I am not sure what I should use as parent node to make this code work in google news case. – Liang Wu Nov 27 '20 at 05:08
  • Does this answer your question? [Google News in R](https://stackoverflow.com/questions/64843821/google-news-in-r) – ekoam Nov 27 '20 at 15:14

1 Answers1

2

Without knowing the exact page you were looking at I tried the first Google news page.

In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.

Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.

library(rvest)

url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)

Title = html_dat %>%  html_nodes('.DY5T1d') %>%   html_text()

# Link = dat$Link
Link = html_dat %>%  html_nodes('.VDXfz') %>%   html_attr('href') 
Link <-  gsub("./articles/", "https://news.google.com/articles/",Link)

#Find the common parent node 
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>%  html_nodes('.DY5T1d') %>% xml_parent()  %>% xml_parent() 
Description =  Titlenodes %>%  html_node('.Rai5ob') %>%  html_text()
Time =  Titlenodes %>%  html_node('.WW6dff') %>%   html_text() 
 
answer <- data.frame(Title, Time, Description, Link)
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • 1
    This solution does not account for [nested articles](https://i.stack.imgur.com/fZTVj.png). – ekoam Nov 27 '20 at 15:15