0

I was using rvest to scrape a website for a couple of interested info on the webpage. An example page is like this https://www.edsurge.com/product-reviews/mr-elmer-product/educator-reviews, and I wrote a function like this:

PRODUCT_NAME2 <- c()
REVIEW <- c()
USAGE <- c()

DF4 <- data.frame(matrix(ncol=3, nrow=0))

parse_review_page <- function(url) {
    product_name2 <- read_html(url) %>%
            html_nodes(".mb0 a") %>%
            html_text()
    review <- read_html(url) %>% 
            html_nodes(".review-ratings__text strong") %>%
            html_text() 
    usage <- read_html(url) %>% 
            html_nodes("p:nth-child(3)") %>%
            html_text() 


    row_num <- length(review)
    product_name2 <- rep(product_name2, row_num)

    PRODUCT_NAME2 <- c(PRODUCT_NAME2, product_name2)
    print(length(PRODUCT_NAME2))

    REVIEW <- c(REVIEW, review)
    print(length(REVIEW))

    USAGE <- c(USAGE, usage)
    print(length(USAGE))

    current_df2 <- data.frame(PRODUCT_NAME2, REVIEW, USAGE)
    DF5 <<- rbind(DF4, current_df2)
    return (DF5)
    }

and I used this to put the result into a dataframe:

url_to_scrape <- c("https://www.edsurge.com/product-reviews/mr-elmer- 
product/educator-reviews")

DF6 <- url_to_scrape %>% map_dfr(parse_review_page)

But the problem I'm encountering is that, as there are 100+ user reviews, the webpage would only show 30 user reviews. What could be more challenging is that the url won't change after clicking on the 'Load More' at the bottom of the page, so there is essentially no 2nd, 3rd ...page to scrape. Can anyone give a suggestion about how to resolve this issue so I could scrape all the review data by running the function I created, please?

Edward Lin
  • 609
  • 1
  • 9
  • 16
  • 1
    If the paging is done via javascript, you cannot use `rvest`. Try [`RSelenium`](https://github.com/ropensci/RSelenium) (currently not on CRAN). – r2evans Jul 15 '18 at 00:59

1 Answers1

0

Here is a sample code that uses http requests to read the next few pages:

library(httr)
library(xml2)
library(magrittr)

url <- "https://www.edsurge.com/product-reviews/mr-elmer-product/educator-reviews"
elmer <- GET(url)
xcrsftoken <- read_html(rawToChar(elmer$content)) %>% 
    xml_find_first(".//meta[@name='csrf-token']") %>% 
    xml_attr("content")

for (n in 1:5) {
    resp <- GET(paste0(url, "/feed?page=",n,"&per_page_count=30"), 
        add_headers("X-CSRF-Token"=xcrsftoken, "X-Requested-With"="XMLHttpRequest"))

    if (status_code(resp)==200) 
        write_html(read_html(rawToChar(resp$content)), paste0(n, ".html"))
}
chinsoon12
  • 25,005
  • 4
  • 25
  • 35