2

So I tried using the tidytext package to do bigrams topic modeling, by following the steps on the tidytext website: https://www.tidytextmining.com/ngrams.html.

I was able to get to the "word_counts" part, where R calculates each bi-gram's frequency.

"word_counts" returned a the following:

   customer_id       word          n
   <chr>            <chr>        <int>
 1 00000001234  sample text        45
 2 00000002345  good morning       30
 3 00000003456  happy friday       24

The next step was to put information from above into a dtm format

My code is below:

lda_dtm <- word_counts %>%
  cast_dtm(customer_id, word, n)

A warning message was raised:

Warning message:
Trying to compute distinct() for variables not found in the data:
- `row_col`, `column_col`
This is an error, but only a warning is raised for compatibility reasons.
The operation will return the input unchanged. 

But the "lda_dtm" looks like its in the right format.

lda_dtm
<<DocumentTermMatrix (documents: 9517, terms: 341545)>>
Non-/sparse entries: 773250/3249710515
Sparsity           : 100%
Maximal term length: NA
Weighting          : term frequency (tf)

However when I tried to run lda, it did not work.

burnin <- 4000
iter <- 300
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 6

out_LDA <- LDA(lda_dtm, 
                            k = k, 
                            method="Gibbs", 
                            control = list(nstart=nstart, 
                                           seed = seed, 
                                           best=best, 
                                           burnin = burnin, 
                                           iter = iter, 
                                           thin = thin))

The following warning was raised:

Error in seq.default(CONTROL_i@iter, control@burnin + control@iter, by = control@thin) : 
  wrong sign in 'by' argument

I don't see a topic modeling tutorial on the tidy text website for bi-grams, the tutorial was specifically for unigrams. How should I adjust the format for it to work with bi-grams?

1 Answers1

5

1: The message you get from cast_dtm actually comes from cast_sparse. There are two issues, #120 and #121, on github that deal with this. At the moment this is fixed in the package on github but this is not releases yet to cran.

If you want to, you can install it from github with devtools::install_github("juliasilge/tidytext").

2: The error you get from LDA has nothing to do with 1. If you just run out_LDA <- LDA(lda_dtm, k = k) LDA will run just fine. The problem lies in your control option thin. This should be less than or equal to the iter paramater. In your case this is set as 500, while iter is at 300. Hence the error. You can see the error appearing when thin is 1 higher than iter.

phiver
  • 23,048
  • 14
  • 44
  • 56