I have a need to append a CSV using a parallel loop, and I was wondering if there was anyway to do that with no errors.
Basically, I need to process a lot of data, and you can't fit it all into memory, so I need to append the results. It would take forever on a lapply
loop, so I'm using the pbapply
package. But when appending the files, since often two cores will be appending at the same time, it messes the csv configuration.
I'm assuming there's some way to have the connection to a file be locked while some cluster is processing it, and just have other clusters wait a bit when that connection is closed to retry, but I couldn't find a way to do it.
Here's an example of the type of error I'm getting:
library(parallel)
library(pbapply)
library(data.table)
write_random_thing <- function(x){
require(data.table)
y <- data.table(A = x, B = round(rnorm(10)*100,2))
pth <- 'example.csv'
fwrite(y, pth, append = TRUE)
y
}
cl <- makeCluster(4)
xx <- pblapply(1:20, cl = cl, FUN = write_random_thing)
stopCluster(cl = cl)
yy <- rbindlist(xx)
zz <- fread('example.csv') # this will usually return an error
In this case, yy
and zz
should be the same (even at a different order) but often the file can't even be read because the number of columns is not constant.
I was looking for some solution in which if the file is locked when you try to write it, it sleeps for some seconds and try again. Does something like that exists?