5

I have a series of documents (~50,000), that I've transformed into a corpus and have been building LDA objects using the topicmodels package in R. Unfortunately, in order to test more than 150 topics, it takes several hours.

So far, I've found that I can test several different clusters sizes simultaneously using:

library(topicmodels)
library(plyr)
library(foreach)
library(doMC)
registerDoMC(5) # use 5 cores

dtm # my documenttermmatrix

seq <- seq(200,500, by=50)

models <- llply(seq, function(d){LDA(dtm, d)}, .parallel=T)

Is there not a way to parallelize the LDA function so that it runs faster (rather than running multiple LDAs at once)?

Optimus
  • 1,354
  • 1
  • 21
  • 40
  • 1
    Sorry, not clear what the question here is? – tchakravarty Jan 22 '15 at 13:29
  • How can I parallelize the LDA function in the topic models package in R (similar to what is shown in the link I posted. The discussion there only covers a Windows implementation which is quite different)? I also wonder if there are any other alternatives, specifically in R. – Optimus Jan 22 '15 at 13:45
  • That is why your question is not clear -- what is _your_ computing environment, and which one of those parallel implementations does not work in your environment? – tchakravarty Jan 22 '15 at 13:54
  • Parallelization on a real OS is much easier than on Windows. E.g., use the `foreach` alternative and read some vignettes regarding the (several) parallel backends which can be used on linux. – Roland Jan 22 '15 at 13:55
  • fg nu, I said that I'm using "AWS server (16 cores) is linux". Do you have a more specific question about my environment? Tyler, what is MWE? Roland, I've found that I can use a mix of the Plyr, foreach, and DoMC package to run the LDA function with different clusters amounts (ie, 200, 250, 300, etc) in parallel. Are there not any implementations that parallelize the LDA function itself to make it faster? – Optimus Jan 22 '15 at 14:00
  • Ok, I could not make out that the AWS server referred to was _your_ server. Most of the code listed there should work for you -- please report back with the specific parts that don't on your computing environment. – tchakravarty Jan 22 '15 at 14:03
  • Sorry, this is one of my first attempts at using Stack Overflow. I realize that it wasn't clear, and neither was my exact problem. I've added what I've tested (which works to run several clusters at the same time), but it doesn't actually speed up the LDA function. Is there any way to do that? – Optimus Jan 22 '15 at 14:11
  • @AnthonyBissell Don't think that is trivial. See the C++ implementation [here](https://code.google.com/p/plda/) and some recent literature [here](http://link.springer.com/chapter/10.1007%2F978-3-642-02158-9_26). Not sure if there is an R implementation. – tchakravarty Jan 22 '15 at 14:29
  • I don't think it's trivial. I only suspected that I might be missing an R implementation of the algorithms in the articles you just linked or that there might be something about the topicmodels package that I didn't know. – Optimus Jan 22 '15 at 14:40
  • Is there anything equivalent to the LDA function in the stm package? I've parallelized some functions in stm, but not in topicmodels. – Steve Weston Jan 23 '15 at 19:50
  • If you still need to do this you might consider using [Spark's implementation](https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda). Note however the predict function has not yet been implemented. – Chris Jul 18 '15 at 13:52

2 Answers2

3

I am not familiar with the LDA function, but lets say you split the corpus into 16 pieces, and put each piece in a list called corpus16list.

To run it in parallel you will usually do something like the following:

library( doParallel )
cl <- makeCluster( 16 ) # for 16 processors
registerDoParallel( cl )


# now start the chains
nchains <- 16
my_k <- 6 ## or a vector with 16 elements
results_list <- foreach(i=1:nchains , 
                    .packages = c( 'topicmodels') %dopar% {
         result <- LDA(corpus16list[[i]], k=my_k ,  control = my_control)}, .progress = "text"))


         return(result) }

The result is results_list, which is a list containing 16 outputs from your 16 chains. You can join them as you see fit, or use a .combine function in foreach (which is beyond the scope of this question).

You can use i to send different values of control, k, or whatever you need.

This code should work on Windows and Linux, and with how ever many cores you need.

dwcoder
  • 478
  • 2
  • 8
  • OP already has code to run LDA independently in parallel, he is looking AFAICT for a parallel implementation of the algorithm itself. – tchakravarty Jan 22 '15 at 15:42
  • fg nu is correct, but thanks for the alternative script. – Optimus Jan 22 '15 at 15:49
  • Thanks, I wrote this script before OP edited the original question. I didn't know he already had parallel script. – dwcoder Jan 22 '15 at 16:02
  • I think it's certainly helpful to have all of the possible options for this problem in one place. There aren't a lot of results that come up when you search for this. – Optimus Jan 22 '15 at 16:13
0

I don't think you can parallelize the LDA model itself since it is optimizing the maximum likelihood, therefore it requires knowing the previous likelihood to carry on the optimization.

Maelba
  • 1
  • 1