Taking a latent semantic analysis (lsa) object and scoring on new data in R

Question

I am running latent semantic analysis (LSA) using textmineR in R. What I'm hoping to get is the document by topic matrix with topics scores by document, which I can do by calling theta from my lsa object (below). However, I am running into challenges taking my created lsa object and using it to score a new dataset (i.e. document term matrix, dtm) so that I can apply my pre-existing topic structures on new data. In the example below, I create two topics, and then when I try to use the same exact dtm (pretending it is a new file for the sake of this example), I get the following error:

"Error in predict.lsa_topic_model(model, dtm_m) : newdata must be a matrix of class dgCMatrix or a numeric vector"

I need to use a lsa object to score new text. Is there an easy fix that I'm missing? I haven't had good luck coercing the matrix to a "dgCMatrix". I actually am not aware how to do this with other packages like lsa either. Any help on this approach would be greatly appreciated.

file = as.data.frame(matrix( c('case1', 'this is some SAMPLE TEXT!',
'case2',  'and this is the 2nd version of that text...', 
'case3', 'more stuff to talk about'), 
        nrow=3,              
        ncol=2,              
        byrow = TRUE))
names(file) [1] <- 'doc_id'
names(file) [2] <- 'text'

library(tm)
wordCorpus <- Corpus(DataframeSource(file))

cleaner <- function (wordCorpus) {
  wordCorpus <- tm_map(wordCorpus, removeNumbers)
  wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
  wordCorpus <- tm_map(wordCorpus, removePunctuation)
  return (wordCorpus)
}
wordCorpus <- cleaner (wordCorpus)

tokenizer <- function(x) 
  NGramTokenizer(x, Weka_control(min = 1, max = 2))
dtm  <- DocumentTermMatrix (wordCorpus, control = list (tokenize=tokenizer, weighting = weightTfIdf))
dtm_m <- as.matrix(dtm)

library(textmineR)
model <- FitLsaModel(dtm = dtm_m,  k = 2)

#this is what I want to get, but ideally also 
#be able to save the "model" object and use to create this in a new sample`

values <- as.data.frame (model$theta)
values
#pretending my original dataset is a new sample and using predict
values_other <- predict (model, dtm_m)

You can create a dgCMatrix with the Matrix package. Either with `Matrix::sparseMatrix()` or with `Matrix::Matrix(x, sparse= TRUE)`. The first function is the recommended way. — phiver, Feb 08 '19 at 17:25
Great, that answered it. Out of curiosity, why is the first function preferred? I ran the second, but the first requires outlining various inputs from the original matrix. — Drew, Feb 08 '19 at 20:05
The first option gives you more control, but is also a lot more efficient in memory consumption. For small matrices it doesn't matter, but for a bit larger ones, you can run into problems. — phiver, Feb 09 '19 at 14:26

score 0 · Answer 1 · answered Dec 28 '19 at 23:57

For workflows like this, you can pretty safely skip using tm altogether and just use textmineR's CreateDtm function directly.

See the LSA example as part of textmineR's vignette, which shows this exact workflow. https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html

Taking a latent semantic analysis (lsa) object and scoring on new data in R

1 Answers1