2

I'm trying to run PCA on two large datasets derived from the same parent dataset earlier in the script. I would like to perform the PCA in parallel on each of the objects, but for some reason I can't get it to work. The code block runs successfully and produces the expected output if run with a regular for loop, but each one takes about 1h to run, and I'd like to take advantage of the server's capacity, as I have to do this bot ~15 datasets.


This is my code:

selectObject <- function(object) {

                if(object == "scaled") {
                         scaling <<- "_scaleOnly"
                         pca.result <<- "pca.scaled"
                         object.path <<- path.scaled.object
                } 

                if(object == "scaled.regressed") {

                         scaling <<- "_scale_nUMIregress"
                         pca.result <<- "pca.scaled.regressed"
                         object.path <<- path.scaled.regressed.object

                 }
}

seurat.objects <- list(scaled=seurat.object.scale,
                  scaled.regressed=seurat.object.scale.regress
                  )


library(foreach)
library(doParallel)

cores <- detectCores()
cl <- makeCluster(2)
doParallel::registerDoParallel(cl)

foreach(object=names(seurat.objects)) %dopar% {

    print(object)

    selectObject(object)

    print(paste(object, pca.result, scaling, pca.path))


    assign(pca.result,
      doFastPCA(t(seurat.objects[[object]]@scale.data))
    )


    saveRDS(pca.result, 
            paste0("/path/to/pcaObject.", age, scaling, ".Rds")

    )

}

The above stalls forever without producing even the very first print() output, and when I cancel the process with ^C, I get the following error:

Error: "'...' used in an incorrect context"

But, if I replace the foreach line with:

for (object in names(seurat.objects)) {

[everything as above]

}

then it runs successfully, albeit sequentially.

What am I doing wrong?

Carmen Sandoval
  • 2,266
  • 5
  • 30
  • 46
  • 1
    can you give an example object/data to run the function on? I have an idea what could be wrong, but want to test. EDIT: sometimes its helpful to add option `makeCluster(2, outfile="")` to print output of foreach loop to console – Reilstein Sep 20 '18 at 19:51
  • Thanks for the tip! What's the best way to share an example? These datasets are so large... – Carmen Sandoval Sep 20 '18 at 20:19
  • 1
    Yeah I know, I work in the same field -- best way to share an example is to make a toy dataset with the same properties as your larger dataset. It doesn't have to be perfect, but at least have the same relevant columns + properties relevant to the question at hand. Make sure to include the code used to generate the example data within your question. – Reilstein Sep 20 '18 at 20:25

0 Answers0