1

I have a large data.frame of 20M lines. This data frame is not only numeric, there is characters as well. Using a split and conquer concept, I want to split this data frame to be executed in a parallel way using snow package (parLapply function, specifically). The problem is that the nodes run out of memory because the data frame parts are worked in RAM. I looked for a package to help me with this problem and I found just one (considering the multi type data.frame): ff package. Another problem comes from the use of this package. The split result of a ffdf is not equal to a split of a commom data.frame. Thus, it is not possible to run the parLapply function.

Do you know other packages for this goal? Bigmemory only supports matrix.

1 Answers1

0

I've benchmarked some ways of splitting the data frame and parallelizing to see how effective they are with large data frames. This may help you deal with the 20M line data frame and not require another package.

The results are here. The description is below. enter image description here

This suggests that for large data frames the best option is (not quite the fastest, but has a progress bar):

library(doSNOW)
library(itertools)

# if size on cores exceeds available memory, increase the chunk factor
chunk.factor <- 1 
chunk.num <- kNoCores * cut.factor
tic()
# init the cluster
cl <- makePSOCKcluster(kNoCores)
registerDoSNOW(cl)
# init the progress bar
pb <- txtProgressBar(max = 100, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
# conduct the parallelisation
travel.queries <- foreach(m=isplitRows(coord.table, chunks=chunk.num),
                          .combine='cbind',
                          .packages=c('httr','data.table'),
                          .export=c("QueryOSRM_dopar", "GetSingleTravelInfo"), 
                          .options.snow = opts) %dopar% {
                            QueryOSRM_dopar(m,osrm.url,int.results.file)
                          }
# close progress bar
close(pb)
# stop cluster
stopCluster(cl) 
toc()

Note that

  • coord.table is the data frame/table
  • kNoCores ( = 25 in this case) is the number of cores

    1. Distributed memory. Sends coord.table to all nodes
    2. Shared memory. Shares coord.table with nodes
    3. Shared memory with cuts. Shares subset of coord.table with nodes.
    4. Do par with cuts. Sends subset of coord.table to nodes.
    5. SNOW with cuts and progress bar. Sends subset of coord.table to nodes
    6. Option 5 without progress bar

More information about the other options I compared can be found here.

Some of these answers might suit you, although they doesn't relate to distributed parlapply and I've included some of them in my benchmarking options.

Tom Logan
  • 341
  • 3
  • 15