0

I've noticed we don't have many questions here about Rcrawler, and I thought it's a great tool to scrape website. However, I have a problem telling it to scrape multiple websites as it can only do 3 currently. Please let me know if anyone has experience with this issue. Thanks.

I've tried putting all URLs in a list/vector, but it still doesn't do it. Here are my scraping codes to get the title, description, and keywords of the websites.

Rcrawler(Website = c("http://www.amazon.com", "www.yahoo.com", "www.wsj.com"),
 no_cores = 3, no_conn = 3, MaxDepth = 0, 
ExtractXpathPat = c('/html/head/title', '//meta[@name="description"]/@content', '//meta[@name="keywords"]/@content'),
 PatternsName = c("Title", "Description", "Keywords"), saveOnDisk = FALSE)

If I have more than 3 websites, it will give me this error:

Error in Rcrawler(Website = c("http://www.amazon.com", "www.yahoo.com",  : 
  object 'getNewM' not found
cheklapkok
  • 439
  • 1
  • 5
  • 11

2 Answers2

0

Something like this.

library(tidyverse)
library(rvest)

# Create vector
mylist <- c("http://www.amazon.com", "http://www.yahoo.com", "http://www.wsj.com")
# Create the for statement
for (i in mylist)
{ 
  #print(i)
  webpage <- read_html(i)
  print(webpage)

}

Or, load each page into a list and parse the list. Finally, you may consider saving your results to a CSV. You should know, scraping lots of different web pages will almost certainly product very different results. I can certainy understand why a person would want to loop through different URLs of the same site, but I'm not sure what you will gain by looping through different URLs of different sites.

ASH
  • 20,759
  • 19
  • 87
  • 200
0

I'm not sure how this would work in theory, but you can try creating many calls to R crawler.
For example in a while loop:

a <- list()

Rcrawler_function <- function(no_conn,no_cores,MaxDepth ,Obeyrobots,saveOnDisk,ExtractXpath)
{
  x <- 1
  while(x < 5)
  {
    tryCatch( expr = {

  Rcrawler(ads_txt_4_cat$gen_link[x],no_conn = no_conn,no_cores = no_cores,MaxDepth = MaxDepth,Obeyrobots = Obeyrobots,saveOnDisk = saveOnDisk,ExtractXpathPat = ExtractXpath)
   assign("DATA",DATA,envir = globalenv())
     a[[x]] <<- DATA
    x = x+1


    }
  , error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
  }
}

Rcrawler_function(4,4,0,T,F,"body")
APhillips
  • 1,175
  • 9
  • 17
advance84
  • 11
  • 3