Parallelisation to scrape web content with R

Question

I am trying to scrape data from web using asynchronous approach as mentioned in this post. Here is the urls that I want to scrape data from. I store the urls in list.Rdata file. The links could be downloaded from here: https://www.dropbox.com/s/wl2per5npuq5h8y/list.Rdata?dl=1.

To begin with I load first 1000 urls:

library(RCurl)  
library(rvest)
library(XML)
library(httr)
library(reshape2)
library(reshape)

load("list.Rdata")
list <- list[1:1000]
un <- unlist(list)

Then I use code to scrape the content from that urls:

get.asynch <- function(urls){
  txt <- getURIAsynchronous(urls)
    doc <- htmlParse(txt,asText=TRUE,encoding = "UTF-8")
    base <- xpathSApply(doc, "//table//tr//td",xmlValue)
    # Pavadinimas
    uab <- ifelse(length(xpathSApply(doc, "//head//title",xmlValue))==1,gsub(". Rekvizitai.lt","", xpathSApply(doc, "//head//title",xmlValue)), "-")
    # Imones kodas
    ik <- ifelse(is.na(agrep("Imones kodas",base))==TRUE, "-", base[agrep("Imones kodas",base)+1])
    # PVM kodas
    pk <- ifelse(is.na(match("PVM kodas",base))==TRUE, "-", base[match("PVM kodas",base)+1])
    # Vadovas
    vad <- ifelse(is.na(match("Vadovas",base))==TRUE, "-", base[match("Vadovas",base)+1])
    # Adresas
    ad <- ifelse(is.na(match("Adresas",base))==TRUE, "-", base[match("Adresas",base)+1])
    # Telefonas
    tel <- ifelse(is.na(match("Telefonas",base))==TRUE, "-", paste("http://rekvizitai.vz.lt", xpathSApply(doc, "//table//tr//td//@src")[1], sep =""))
    # Mobilusis
    mob <- ifelse(is.na(match("Mobilusis",base))==TRUE, "-", paste("http://rekvizitai.vz.lt", xpathSApply(doc, "//table//tr//td//@src")[2], sep =""))
    # Tinklalapis
    url <- ifelse(is.na(match("Tinklalapis",base))==TRUE, "-", gsub("\t","",base[match("Tinklalapis",base)+1]))
    # Skype
    sk <- ifelse(is.na(match("Skype",base))==TRUE, "-", base[match("Skype",base)+1])
    # Bankas
    bnk <- ifelse(is.na(match("Bankas",base))==TRUE, "-", base[match("Bankas",base)+1])
    # Atsiskaitomoji saskaita
    ats <- ifelse(is.na(match("Atsiskaitomoji saskaita",base))==TRUE, "-", base[match("Atsiskaitomoji saskaita",base)+1])
    # Darbo laikas
    dl <- ifelse(is.na(match("Darbo laikas",base))==TRUE, "-", base[match("Darbo laikas",base)+1])
    # Darbuotojai
    drb <- ifelse(is.na(match("Darbuotojai",base))==TRUE, "-", gsub("\\D","",base[match("Darbuotojai",base)+1]))
    # SD draudejo kodas
    sd <- ifelse(is.na(match("SD draudejo kodas",base))==TRUE, "-", base[match("SD draudejo kodas",base)+1]) 
    # Apyvarta (be PVM)
    apv <- ifelse(is.na(match("Apyvarta (be PVM)",base))==TRUE, "-", base[match("Apyvarta (be PVM)",base)+1])
    # Transportas
    trn <- ifelse(is.na(match("Transportas",base))==TRUE, "-", base[match("Transportas",base)+1])
    # Ivertinimas
    iv <- ifelse(length(xpathSApply(doc, "//span[@class='average']", xmlValue)) !=0, xpathSApply(doc, "//span[@class='average']", xmlValue),"-")
    # Vertintoju skaicius
    vert <- ifelse(length(xpathSApply(doc, "//span[@class='votes']", xmlValue)) !=0, xpathSApply(doc, "//span[@class='votes']", xmlValue),"-")
    # Veiklos sritys
    veikl <-xpathSApply(doc,"//div[@class='floatLeft about']//a | //div[@class='floatLeft about half']//a | //div[@class='about floatLeft']//a",
                        xmlValue)[1]
    # Lentele
    df <- cbind(uab, ik, pk, vad, ad, tel, mob, url, sk, bnk, ats, dl, drb, sd, apv, trn, iv, vert, veikl)
}

Next, I use my function to parse the content and get the error. I'm pretty sure this error is the result of heavy request to server.

> system.time(table <- do.call(rbind,lapply(un,get.asynch)))
 Error in which(value == defs) : 
  argument "code" is missing, with no default Timing stopped at: 0.89 0.03 6.82

I'm looking for a solutions to avoid such behavior. I tried Sys.sleep() function although the result is the same. Any help on how to overcome the connection to server problems would be welcomed.

Parallelisation web requests is rude, because you're hammering someones server. — hadley, Feb 18 '15 at 03:40
Thank's for reply. I noticed that, that's why I'm searching alternative solution to avoid such behavior. The approach when each url parsed sequentially one-by-one with specific time span worked out although it is not efficient and time consuming. Any idea on how to improve the algorithm using parallelization approach would be highly appreciated. — Aleksandr, Feb 18 '15 at 08:50

score 2 · Accepted Answer · edited May 23 '17 at 12:22

2

I've searched for a few minutes and found the answer here (second reply) R getURL() returning empty string

You need to use

txt <- getURIAsynchronous(un, .opts = curlOptions(followlocation = TRUE))

There is also another problem - you actually don't do it asynchronously. With lapply(un,get.asynch) you send URLs to get.asynch one by one. To do it parallely you would need something like get.asynch(un), but then you'd have to rewrite the rest of the code. I would split it into two parts: curling

txts <- getURIAsynchronous(un, .opts=curlOptions(followlocation = TRUE))

and parsing

parse <- function(txt) { 
    doc <- htmlParse(txt,asText=TRUE,encoding = "UTF-8")
    base <- xpathSApply(doc, "//table//tr//td",xmlValue)
    ...
}
table <- do.call(rbind, lapply(txts, parse))

Curling worked fine for me, at least for first 100 links. I didn't test parsing part though.

edited May 23 '17 at 12:22

Community

1
1

answered Feb 17 '15 at 17:32

BartekCh

920
6
15

Thank's for reply. I tried curling part, it worked only for small sample size (number of url < ~100). Parsing function return an empty list. I believe it is because of multiple requests to server. I also tried to navigate to server and click some urls manually after the code execution. As a result the captcha appeared so I think server rejects such heavy requests. This could explain why parse function return an empty table. – Aleksandr Feb 17 '15 at 19:14
1

It worked for longer URL vectors for me. Propably it depends somehow on server load or something. Maybe try to curl pages one by one, with some break between (for example `Sys.sleep(1+ runif(1)*4)`), but then it will take much more time. You must be patient :) – BartekCh Feb 19 '15 at 14:29
Well, there are ~140K urls in total, so if I include system.sleep() it would take couple of days to handle the task. Another solution on my "try" list is use different proxy servers that would change sequentially. – Aleksandr Feb 19 '15 at 16:57
Some improvement and a bit different approach solved my problem. `curl <- getCurlHandle() curlSetOpt(proxy='127.0.0.1:9150',proxytype=5,curl=curl) html <- getURL(url=base,curl=curl, .opts = list(ssl.verifypeer = FALSE),followlocation=TRUE) doc <- htmlParse(html, encoding = "UTF-8")` – Aleksandr Jul 02 '15 at 08:24

Parallelisation to scrape web content with R

1 Answers1