Appending CSV in a parallel loop with no error

Question

I have a need to append a CSV using a parallel loop, and I was wondering if there was anyway to do that with no errors.

Basically, I need to process a lot of data, and you can't fit it all into memory, so I need to append the results. It would take forever on a lapply loop, so I'm using the pbapply package. But when appending the files, since often two cores will be appending at the same time, it messes the csv configuration.

I'm assuming there's some way to have the connection to a file be locked while some cluster is processing it, and just have other clusters wait a bit when that connection is closed to retry, but I couldn't find a way to do it.

Here's an example of the type of error I'm getting:

library(parallel)
library(pbapply)
library(data.table)

write_random_thing <- function(x){
  require(data.table)

  y <- data.table(A = x, B = round(rnorm(10)*100,2))

  pth <- 'example.csv'
  fwrite(y, pth, append = TRUE)

  y
}

cl <- makeCluster(4)
xx <- pblapply(1:20, cl = cl, FUN = write_random_thing)
stopCluster(cl = cl)

yy <- rbindlist(xx)

zz <- fread('example.csv') # this will usually return an error

In this case, yy and zz should be the same (even at a different order) but often the file can't even be read because the number of columns is not constant.

I was looking for some solution in which if the file is locked when you try to write it, it sleeps for some seconds and try again. Does something like that exists?

score 3 · Answer 1 · answered Apr 10 '19 at 19:31

If you need to write something in parallel, you need locks to make sure that two processes are not writing at the same time.

This is easily done in R with package {flock}:

library(parallel)
library(pbapply)
library(data.table)

write_random_thing <- function(x){
  require(data.table)

  y <- data.table(A = x, B = round(rnorm(10)*100,2))

  pth <- 'example.csv'
  lock <- flock::lock(pth)
  fwrite(y, pth, append = TRUE)
  flock::unlock(lock)

  y
}

cl <- makeCluster(4)
xx <- pblapply(1:20, cl = cl, FUN = write_random_thing)
stopCluster(cl = cl)

yy <- rbindlist(xx)

zz <- fread('example.csv') # this will usually return an error

score 0 · Answer 2 · answered Apr 10 '19 at 18:38

I would do something like this to append file parallely-

require(doParallel)
require(doRNG)

ncores <- 7
cl <- makeCluster( ncores , outfile = "" )
registerDoParallel( cl )

res <- foreach( j = 1:100 , .verbose = TRUE , .inorder= FALSE ) %dorng%{
    d <- matrix( rnorm( 1e3 , j ) , nrow = 1 )
    conn <- file( sprintf("~/output_%d.txt" , Sys.getpid()) , open = "a" )
    write.table( d , conn , append = TRUE , col.names = FALSE )
    close( conn )
}

Appending CSV in a parallel loop with no error

2 Answers2

Linked