0

Need to Scrape Product Information from a Ecommerce Page. But page has infinite scrolling. Currently I am able to scrape only products shown without scrolling down. Below is the code for it.

require(RCurl)
require(XML)
require(dplyr)
require(stringr)

webpage <- getURL("http://www.jabong.com/kids/clothing/girls-clothing/kids-tops-t-shirts/?source=topnav_kids")

linklist <- str_extract_all(webpage, '(?<=href=")[^"]+')[[1]]
linklist <- as.data.frame(linklist)
linklist <- filter(linklist, grepl("\\?pos=", linklist))
linklist <- unique(linklist)

a <- as.data.frame(linklist)
a[2] <- "Jabong.com"
a <- add_rownames(a, "ID")
a$V3 <- gsub(" ", "", paste(a$V2, a$linklist))
a <- a[, -(1:3)]
colnames(a) <- "Links"
  • Need links of each product available on the page..the above script gives top 52 products links..but i need all links available..as jabong has infinite scrolling..maybe rselenium can help..but not able to use it – Nitin Kansal Apr 22 '16 at 06:23

1 Answers1

2

Well, if scrolling is truly infinite, then it is impossible to get ALL of the links... If you wanted to settle for a finite number, you can indeed fruitfully use RSelenium here.

library(RSelenium)

#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

# load your page
remDr$navigate("http://www.jabong.com/kids/clothing/girls-clothing/kids-tops-t-shirts/?source=topnav_kids")

# scroll down 5 times, allowing 3 second for the page to load everytime
for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

# get the page html
page_source<-remDr$getPageSource()

# get the URL's that you are looking for
pp <- xml2::read_html(page_source[[1]]) %>% 
  rvest::html_nodes("a") %>% 
  rvest::html_attr("data-original-href") %>% 
  {.[!is.na(.)]}

The result is 312 links (in my browser). The more you have RSelenium scroll down, the more links you'll get.

Peter Verbeet
  • 1,786
  • 2
  • 13
  • 29
  • the code is not working for me..remDr$open()...gives the error...Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts))... – Nitin Kansal Apr 25 '16 at 06:30
  • Do you have `RSelenium` essentials i.e. 1) selenium jarfile 2) chromedriver.exe / firefox driver, ensure you can run the code on this link first before trying the posted solution [link1](http://stackoverflow.com/questions/31124702/rselenium-unknownerror-java-lang-illegalstateexception-with-google-chrome) and [link2](http://johndharrison.blogspot.in/2014/03/rselenium-package.html) – Silence Dogood Apr 25 '16 at 11:55
  • @Peter : The above script has worked for 11 scrollings only..after that there is a button "Show More Products" that needs to be clicked inorder to scroll down..what should we add to the present script inorder to scroll further? – Nitin Kansal Apr 28 '16 at 09:07