0

I'm using a topic modeling approach that works well on my computer in RStudio, except that it takes ages. So I'm using a linux cluster. However, also I seem to request a lot of capacity, it doesn't really speed up:

I'm sorry I'm a greenhorn... So this is what I'm using in the pageant shell:

salloc -N 240 --mem=61440 -t 06:00:00 -p med 
#!/bin/sh

#SBATCH --nodes=200
#SBATCH --time=06:00:00
#SBATCH --partition=med
#SBATCH --mem=102400
#SBATCH --job-name=TestJobUSERNAME
#SBATCH --mail-user=username@ddomain.com
#SBATCH --mail-type=ALL
#SBATCH --cpus-per-task=100

squeue –u username
cd /work/username/data
module load R
export OMP_NUM_THREADS=100
echo "sbatch: START SLURM_JOB_ID $SLURM_JOB_ID (SLURM_TASK_PID $SLURM_TASK_PID) on $SLURMD_NODENAME" 
echo "sbatch: SLURM_JOB_NODELIST $SLURM_JOB_NODELIST" 
echo "sbatch: SLURM_JOB_ACCOUNT $SLURM_JOB_ACCOUNT"

Rscript myscript.R

I'm pretty sure there's sth. wrong with my inputs because:

  • it isn't really faster (but my R code of course could also be just slow - so I tried various R codes with different calculation types)
  • whether I'm using 1 oder 200 nodes, the calculation of the same R script takes almost exactly the same time (there should be at least 244 nodes, though)
  • the echo results do not give complete information and I do not receive e-Mail notifications

so these are my typical outcomes:

#just very small request to copy/paste the results, usually I request the one above
[username@gw02 ~]$ salloc -N 2 --mem=512 -t 00:10:00 -p short
salloc: Granted job allocation 1234567
salloc: Waiting for resource configuration
salloc: Nodes cstd01-[218-219] are ready for job
Disk quotas for user username (uid 12345):
                 --    disk space     --
Filesystem       limit  used avail  used
/home/user         32G  432M   32G    2%
/work/user          1T  219M 1024G    0%

[username@gw02 ~]$ squeue -u username 
      JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      1234567     short     bash username  R       2:14      2 cstd01-[218-219]

#(directory, module load, etc.)

#missing outcomes for SLURM_TAST_PID and SLUMD_NODENAME: 
[username@gw02 data]$ echo "sbatch: START SLURM_JOB_ID $SLURM_JOB_ID (SLURM_TASK_PID $SLURM_TASK_PID) on $SLURMD_NODENAME"
sbatch: START SLURM_JOB_ID 1314914 (SLURM_TASK_PID ) on

Can anybody help? Thank you so much!

EDIT: As Ralf Stubner points out in his comment, I don't do parallelization in the R Code. I have absolutely no idea on how to do that. Here is one example calculation:

# Create the data frame
col1 <- runif (12^5, 0, 2)
col2 <- rnorm (12^5, 0, 2)
col3 <- rpois (12^5, 3)
col4 <- rchisq (12^5, 2)
df <- data.frame (col1, col2, col3, col4)
# Original R code: Before vectorization and pre-allocation
system.time({
  for (i in 1:nrow(df)) { # for every row
    if ((df[i, "col1"] + df[i, "col2"] + df[i, "col3"] + df[i, "col4"]) > 4) { # check if > 4
      df[i, 5] <- "greater_than_4" # assign 5th column
    } else {
      df[i, 5] <- "lesser_than_4" # assign 5th column
    }
  }
})

... and a shortened "real code":

library(NLP)
library(tm)
library(SnowballC)
library(topicmodels)
library(lda)
library(textclean)

# load data and create corups
filenames <- list.files(getwd(),pattern='*.txt')
files <- lapply(filenames,readLines)
docs <- Corpus(VectorSource(files))

# clean data (shortened, just two examples) 
docs.adj <- tm_map(docs.adj, removeWords, stopwords('english'))
docs.adj <-tm_map(docs.adj,content_transformer(tolower))

# create document-term matrix
dtm <- DocumentTermMatrix(docs.adj)
dtm_stripped <- removeSparseTerms(dtm, 0.8)
rownames(dtm_stripped) <- filenames
freq <- colSums(as.matrix(dtm_stripped))
ord <- order(freq,decreasing=TRUE)

### find optimal number of k 
burnin <- 10000
iter <- 250
thin <- 50
seed <-list(3)
nstart <- 1
best <- TRUE
seq_start <- 2
seq_end <- length(files)
iteration <- floor(length(files)/5)

best.model <- lapply(seq(seq_start,seq_end, by=iteration), function(k){LDA(dtm_stripped, k, method = 'Gibbs',control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(seq(seq_start,seq_end, by=iteration)), LL=as.numeric(as.matrix(best.model.logLik)))
optimal_k <- best.model.logLik.df[which.max(best.model.logLik.df$LL),]
print(optimal_k)

### do topic modeling with more iterations on optimal_k
burnin <- 4000
iter <- 1000
thin <- 100
seed <-list(2003,5,63)
nstart <- 3
best <- TRUE
ldaOut <-LDA(dtm_stripped,optimal_k, method='Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
  • 1
    Without special intervention R code is single threaded. So of the 200 requested nodes, 199 would be idle. What are you doing to make the R code work in parallel? Can you show us a simplified example? C.f. [mcve] and https://cran.r-project.org/web/views/HighPerformanceComputing.html – Ralf Stubner Jun 24 '18 at 06:59
  • Hi @RalfStubner, thank you for your note. I updated the question with my R code. I didn't know that R code is single threaded. And I have no idea on how to do that. I will have a closer look at your second link, too. Unfortunately, I am new to R as well, so this looks... very very challenging ;) – GreenPirate Jun 24 '18 at 07:18

1 Answers1

0

From a quick look at your R script, it looks like it is in:

best.model <- lapply(seq(seq_start,seq_end, by=iteration), function(k){
  LDA(dtm_stripped, k, method = 'Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
})

where most of the processing time takes place. Here you could try to parallelize the code using future_lapply() instead of lapply(), i.e.

best.model <- future_lapply(seq(seq_start,seq_end, by=iteration), function(k){
  LDA(dtm_stripped, k, method = 'Gibbs', control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
}, future.seed = TRUE)

I've also added future.seed = TRUE to make sure you random number generation is statistically sound when done in parallel. The future_lapply() function is in the future.apply package (*) so you need to do:

library(future.apply)

at the top of your script. Now there's one final thing you need to do - you need to tell it to run in parallel (the default is sequential) by adding:

plan(multiprocess)

also at the top (after attaching future.apply). The default is to use whatever cores are "available" where "available" means it is also agile to the number of cores the HPC scheduler (e.g. Slurm) allocates to your job. If you try the above on your local machine, it will default to using the number of cores it has. That is, you can verify your code also on your local machine and you should see some speedup. When you know it works, then you can rerun it on the cluster via your Slurm allocation and it should work there out of the box - but run with more parallel processes.

You might find my blog post on future.apply from 2018-06-23 useful - it has some FAQ at the end.

(*) Disclaimer: I'm the author of future.apply.

HenrikB
  • 6,132
  • 31
  • 34
  • Hi @HenrikB, Thank you for the detailed expalantion. I imported it but still get an error message "Error in LDA(dtm_stripped, k, method = "Gibbs", control = list(nstart = nstart, : Each row of the input matrix needs to contain at least one non-zero entry Calls: future_lapply ... values.list -> value -> value.Future -> resignalCondition". Do you have some kind of genius sparking idea, what's the reason for that? :) – GreenPirate Jun 26 '18 at 06:59
  • From a quick look at `topicmodels::LDA()` from where that error originates, it appears to be an issue with the input data `dtm_stripped`. Is there a random component in how that is generated? If so, you'd see the same once in a while if you use plain `lapply()`. If so, set `set.seed(42)` at the top of your script, and verify that `lapply()` works. Then retry with `future_lapply()`. – HenrikB Jun 26 '18 at 16:49
  • Thank you for your great help! I tried a little bit on all variables. One thing that strikes me most: the calculations now are about 40-60% faster than running the script on my computer. It doesn't however make any difference whether I'm using "salloc -N 2 -t 04:00:00 -p med " or "salloc -N 200 -t 06:00:00 -p long". Can you explain a little bit more so I can maybe optimize even more? I cannot stress out enough on how much help your answer was so far. – GreenPirate Jul 03 '18 at 06:00