3

The following is my code. I am trying get the list of all the files (~20000) that end with .idat and read each file using the function illuminaio::readIDAT.

library(illuminaio)
library(parallel)
library(data.table)

# number of cores to use
ncores = 8

# this gets all the files with .idat extension ~20000 files
files <- list.files(path = './',
                    pattern = "*.idat",
                    full.names = TRUE)

# function to read the idat file and create a data.table of filename, and two more columns
# write out as csv using fwrite
get.chiptype <- function(x)
{
  idat <- readIDAT(x)
  res <- data.table(filename = x, nSNPs = nrow(idat$Quants), Chip = idat$ChipType)
  fwrite(res, file.path = 'output.csv', append = TRUE)
}

# using mclapply call the function get.chiptype on all 20000 files.
# use 8 cores at a time
mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores)

After reading and writing info about 1200 files, I get the following message:

Warning message:
In mclapply(files, FUN = function(x) get.chiptype(x), mc.cores = ncores) :
  all scheduled cores encountered errors in user code

How do I resolve it?

Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • what are destdir and destfile – rawr Sep 01 '16 at 20:46
  • They are just directory and filename where the data.table will be written. I will remove that. – Komal Rathi Sep 01 '16 at 20:49
  • and you still get errors? – rawr Sep 01 '16 at 20:53
  • This may not be the problem, but I'd be wary of appending to a single file from parallel processes. I'm not an expert, but that seems a recipe for trouble. Do you know if they lock the file somehow so they can only write one at a time? – Aaron left Stack Overflow Sep 01 '16 at 20:54
  • That is not the problem. Those were just paths to my source and destination files. This is the problem with parallel processing. – Komal Rathi Sep 01 '16 at 20:56
  • I know for sure that I provided 8 cores and it was writing 8 lines simultaneously. It worked for the first 1200 or so files and then it threw an error. It could also be a problem with reading in the idat files. I will check and get back. – Komal Rathi Sep 01 '16 at 20:57
  • Perhaps it could have not tried to write simultaneously until the 1201st file. Again, I'm not saying this is the problem, only that parallel processes are notorious for working sometimes and not others due to race conditions like this. – Aaron left Stack Overflow Sep 01 '16 at 21:03
  • Whether a try catch block would help? – Sambit Tripathy Aug 10 '17 at 20:27

1 Answers1

0

Calling mclapply() in some instances requires you to specify a random number generator that allows for multiple streams of random numbers. R version 2.14.0 has an implementation of Pierre L'Ecuyer's multiple pseudo-random number generator.

Try adding the following before the mclapply() call, with a pre-specified value for 'my.seed':

set.seed( my.seed, kind = "L'Ecuyer-CMRG" );