0

I'm trying to scrape reviews for a certain product on Amazon and export the result in CSV format. I've tried to embed for loop within a function but it kept failing. So I separate function and for loop to see the result and now I don't know how to combine the result of for loop from pages 1 to 10.

When running the script, it shows reviews by pages but when I save the result in CSV, the file has only those on page 10.

How can I combine the result of for loop and save it in CSV altogether?

#install.packages("tidyverse")
#install.packages("rvest")
#install.packages("xml2")

library(tidyverse)
library(rvest)
library(xml2)

#Product = LG OLED77C9PUB Alexa Built-in C9 Series 77" 4K Ultra HD Smart OLED TV (2019)
#ASIN = B07PQ98L9D

scrape_amazon <- function(ASIN, page_num){


url_reviews <- paste0("https://www.amazon.com/LG-OLED77C9PUB-Alexa-Built-Ultra/product-reviews/",ASIN,"/?pageNumber=",page_num)
doc <- read_html(url_reviews)

#Review Date
doc %>%
    html_nodes("[data-hook='review-date']")%>%
    html_text() -> review_data


#Review Title
doc %>%
    html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']")%>%
    html_text() -> review_title

#Review Text
doc %>%
    html_nodes("[class='a-size-base review-text review-text-content']")%>%
    html_text() -> review_text

#Number of Stars in Review
doc %>%
    html_nodes("[data-hook='review-star-rating']")%>%
    html_text() -> review_star

#Return a tibble
tibble(review_data,
       review_title,
       review_text,
       review_star,
       page = page_num)%>%
    return()

}


for (i in 1:10){
    review_all <- scrape_amazon(ASIN = "B07PQ98L9D", page_num = i) %>%
                        print(review_all)
}


#save in csv
write.table(review_all, file= "C:/Users/path/review.csv")
howruhj
  • 3
  • 1

2 Answers2

2

We can use map_df from purrr to get data for 10 pages

library(rvest)
final <- purrr::map_df(1:10, ~scrape_amazon(ASIN = "B07PQ98L9D", page_num = .x))

The issue with for loop is every iteration is overwriting the previous one, hence you get data only for the last one. We can create a list to store data from all the pages.

review_all <- vector("list", length = 10)
 for (i in 1:10){
   review_all[[i]] <- scrape_amazon(ASIN = "B07PQ98L9D", page_num = i)
}
final <- do.call(rbind, review_all)

We can use write.csv to write the data in csv

write.csv(final, "C:/Users/path/review.csv", row.names = FALSE)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

Your reveiw_all variable is getting updated at each iteration of for loop. So at i=1 review_all holds the data of page 1 and as your print command is also within loop, it prints that result. But when you move to next iteration, the review_all gets updated to the data of page 2. So in the end, review_all only holds the data of page 10 which essentially is what you see when write to csv.

Something like the following could help when working with for loops in general. You create a collector variable (in the following case result), whose job is to store the result of each iteration of for loop.

result = vector('list', 10)
for(i in 1:10){
  sq = i^2
  cube = i^3
  quad = i^4
  result[[i]] = c(sq, cube, quad)
}

#converting to df
result <- as_tibble(do.call(rbind, result)) %>% set_colnames(c('sq', 'cube', 'quad'))
monte
  • 1,482
  • 1
  • 10
  • 26