0

I have a folder that has 5000 csv files, each file belonging to one location and containing daily rainfall from 1980 till 2015. Sample structure of a file is as follows:

sample.file <- data.frame(location.id = rep(1001, times = 365 * 36), 
                      year = rep(1980:2015, each = 365),
                      day = rep(1:365, times = 36),
                      rainfall = sample(1:100, replace = T, 365 * 36))

I want to read one file and calculate for each year, total rainfall and write the output again. There are multiple ways I can do this:

Method 1

for(i in seq_along(names.vec)){

  name <- namees.vec[i]
  dat <- fread(paste0(name,".csv"))

  dat <- dat %>% dplyr::group_by(year) %>% dplyr::summarise(tot.rainfall = sum(rainfall))

 fwrite(dat, paste0(name,".summary.csv"), row.names = F)
}

Method 2:

my.files <- list.files(pattern = "*.csv")
dat <- lapply(my.files, fread)
dat <- rbindlist(dat)
dat.summary <- dat %>% dplyr::group_by(location.id, year) %>% 
               dplyr::summarise(tot.rainfall = sum(rainfall))

Method 3:

I want to achieve this using foreach. How can I parallelise the above task using do parallel and for each function?

HenrikB
  • 6,132
  • 31
  • 34
89_Simple
  • 3,393
  • 3
  • 39
  • 94
  • How about method4: `fread` files, `rbind` them and keep using `data.table` for performance (ie, `allFilesBinded[, sum(rainfall), .(location.id, year)]`)? btw, since [1.11.0 `fread` is parallelized](https://cran.r-project.org/web/packages/data.table/news/news.html). – pogibas Sep 13 '18 at 12:51
  • pbapply package provides easy paralleling – Selcuk Akbas Sep 13 '18 at 12:51
  • as easy as pblapply(my.files, fread, cl = mycl) – Selcuk Akbas Sep 13 '18 at 12:52
  • I can't test without your input, but I would go for: `library(data.table); do.call(rbind, lapply(list.files(pattern = "*.csv"), fread))[, sum(rainfall), .(location.id, year)]` – pogibas Sep 13 '18 at 13:02
  • Please clarify what packages are involved, e.g. is unknown/ambiguous what `fread` is. – HenrikB Sep 13 '18 at 13:03
  • 1
    Learn more about parallelism with {foreach} with [this guide](https://privefl.github.io/blog/a-guide-to-parallelism-in-r/). – F. Privé Sep 13 '18 at 13:22

2 Answers2

2

Below is the skeleton for your foreach request.

require(foreach)
require(doSNOW)
cl <- makeCluster(10, # number of cores, don't use all cores your computer have
                  type="SOCK") # SOCK for Windows, FORK for linux
registerDoSNOW(cl)
clusterExport(cl, c("toto", "truc"), envir=environment()) # R object needed for each core
clusterEvalQ(cl, library(tcltk)) # libraries needed for each core
my.files <- list.files(pattern = "*.csv")
foreach(i=icount(my.files), .combine=rbind, inorder=FALSE) %dopar% {
  # read csv file
  # estimate total rain
  # write output
}
stopCluster(cl)

But the parallelization is really better when the computation time (CPU) per independant iteration is higher than the remaining operations. In your case, the improvement can be low because each core will need to have drive access for reading and for writing, and as the writing is a physical operation, it can be better to do it sequentially (safer for the hardware and eventually more efficient to have independant locations in the drive for each file compared to shared location for multiple files, needing indexes and so on to distinguish them for your OS -- the previous need confirmation, it is just a thought).

HTH

Bastien

Bastien
  • 166
  • 5
  • The parallelization should only be done over the reading of the files. The model estimation(\*) and saving to file should not be part of parallelization. (*) Yes, _theoretically_ you might be able to do it here since its a summation. – HenrikB Sep 13 '18 at 16:09
  • To avoid the risk of conveying that `foreach()` is a for loop, rather than an "apply" function, please consider adding an explicit return value, e.g. can `dat.summary <- foreach(...)` just like `dat <- lapply(my.files, fread)`. – HenrikB Sep 13 '18 at 16:10
0

pbapply package is easiest paralleling approach

library (pbapply)

mycl <- makeCluster(4)
mylist <- pblapply(my.files, fread, cl = mycl)
Selcuk Akbas
  • 711
  • 1
  • 8
  • 20
  • pbapply doesn't do any parallelization. All it does is add a progress bar. You can replace `pblapply` with `parallel::parLapply` and it will work exactly the same. – Hong Ooi Sep 13 '18 at 12:56
  • How does this answer the question if it's about adding the progress bar? – pogibas Sep 13 '18 at 12:57
  • cl = mycl enables paralleling. please try or read package reference. also have progress bar – Selcuk Akbas Sep 13 '18 at 12:59
  • cl : A cluster object created by makeCluster, or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evalua- tions. – Selcuk Akbas Sep 13 '18 at 12:59