0

I am running a Latent Dirichlet topic model in R using the follwing code:

for(k in 2:30) {
    ldaOut <-LDA(dtm,k, method="Gibbs", 
                 control=list(nstart=nstart, seed = seed, best=best, 
                              burnin = burnin, iter = iter, thin=thin))
    assign(paste("ldaOut", k, sep = "_"), ldaOut)
}

The dtm has 12 million elements, and each loop takes up to two hours on average. Meanwhile, R uses only 1 of my 8 logical processors ( i have i7-2700K CPU @ 3.50GHz wtih 4 cores). How can I make R use all the computational power available when I run one LDA topic model or when using a loop (as in this code)?

Thank you

EDIT: follwing gc_'s advice, I used the following code:

library(doParallel)

    n.cores <- detectCores(all.tests = T, logical = T) 
    cl <- makePSOCKcluster(n.cores) 

doParallel::registerDoParallel(cl)

burnin <- 4000 
iter <- 2000
thin <- 500 
seed <-list(2003,10,100,10005,765)
nstart <- 5 
best <- TRUE 

var.shared <- c("ldaOut", "dtm", "nstart", "seed", "best", "burnin", "iter", "thin", "n.cores")
library.shared <- "topicmodels" # Same for library or functions.


ldaOut <- c()

    foreach (k = 2:(30 / n.cores - 1), .export = var.shared, .packages = library.shared) %dopar% {
        ret <- LDA(dtm, k*n.cores , method="Gibbs", 
                   control=list(nstart=nstart, seed = seed, best=best, 
                                burnin = burnin, iter = iter, thin=thin))
        assign(paste("ldaOut", k*n.cores, sep = "_"), ret)
    }

The code ran without errors, but now there are 16 "R for Windows front-end" processes, 15 of which use 0% of the CPU and 1 is using 16-17%...And when the process was over i got this message:

A LDA_Gibbs topic model with 16 topics.

    Warning messages:
    1: In e$fun(obj, substitute(ex), parent.frame(), e$data) :
      already exporting variable(s): dtm, nstart, seed, best, burnin, iter, thin, n.cores
    2: closing unused connection 10 (<-MyPC:11888) 
    3: closing unused connection 9 (<-MyPC:11888) 
    4: closing unused connection 8 (<-MyPC:11888) 
    5: closing unused connection 7 (<-MyPC:11888) 
    6: closing unused connection 6 (<-MyPC:11888) 
    7: closing unused connection 5 (<-MyPC:11888) 
    8: closing unused connection 4 (<-MyPC:11888) 
    9: closing unused connection 3 (<-MyPC:11888) 
Michael
  • 159
  • 1
  • 2
  • 14
  • https://cran.r-project.org/web/views/HighPerformanceComputing.html – r2evans Nov 08 '18 at 02:30
  • Have you tried text2vec package for topic model? It is faster. Please see below links:http://text2vec.org/topic_modeling.html and https://stackoverflow.com/questions/52268925/lda-topic-model-using-r-text2vec-package-and-ldavis-in-shinyapp – Sam S. Nov 13 '18 at 23:24

2 Answers2

2

You can use the library doParallel

library(doParallel)

To get the number of cores of your computer:

n.cores <- detectCores(all.tests = T, logical = T) 

You can see the distinction between logical and physical cores.

Now you need to assign the core and set up all the process:

cl <- makePSOCKcluster(n.cores) 
doParallel::registerDoParallel(cl)

You can create more processes than you have cores on your computer. As R is creating new processes you need to define the library and variables you need to share with the workers.

var.shared <- c("ldaOut", "dtm", "nstart", "seed", "best", "burnin", "iter", "thin", "n.cores")
library.shared <- c() # Same for library or functions.

Then the loop will change to:

 ldaOut <- #Init the output#

 foreach (k = 2:(30 / n.cores - 1), .export = var.shared, .packages = library.shared)) %dopar% {
      ret <- LDA(dtm, k*n.cores , method="Gibbs", 
                     control=list(nstart=nstart, seed = seed, best=best, 
                                  burnin = burnin, iter = iter, thin=thin))
      assign(paste("ldaOut", k*n.cores, sep = "_"), ret)
}

I have never used LDA before so you might need to modify a bit the code above in order to make it works.

gc_
  • 111
  • 5
  • thank you! I've tried this code, but got an error as shown above. It seems like the loop can't find the LDA function. Am I doing something wrong? – Michael Nov 08 '18 at 04:08
  • Have you specified the library of the LDA function in library.shared? – gc_ Nov 08 '18 at 04:11
  • ok, i think i didin,t, so now I changed from "ldaOut <- #Init the output#" to "library.shared <- LDA", did i understand you correctly? with this line I get the error "Error in foreach(k = 2:(30/n.cores - 1), .export = var.shared, .packages = library.shared) : .packages must be a character vector" – Michael Nov 08 '18 at 04:35
  • if the package name is LDA then library.shared <- "LDA". For ldaOut you need to initialize this variable with the object (but empty) as return the function LDA. – gc_ Nov 08 '18 at 06:07
  • gc_, the package name is "topicmodels", so i did the following "library.shared <- "topicmodels", and "ldaOut <- c()". With these modifications, I've been able to run the code without any errors, but R is using only 15-17 % of my CPU. There are 15 processes "R for windows front-end" using 0% and 1 using 16%. – Michael Nov 08 '18 at 17:11
1

I think lda is hard to do in parallel, since each sweep uses the result of the previous sweep.

So to speed things up, you could imo

- reduce your dtm
- use faster libraries e.g. vowpal wabbit 
- use faster hardware e.g. aws

If you optimize for "hyperparameters" like alpha, eta, burnin, etc., you could run the full lda with different hyperparams on each core.

Karsten W.
  • 17,826
  • 11
  • 69
  • 103