So I tried using the tidytext package to do bigrams topic modeling, by following the steps on the tidytext website: https://www.tidytextmining.com/ngrams.html.
I was able to get to the "word_counts" part, where R calculates each bi-gram's frequency.
"word_counts" returned a the following:
customer_id word n
<chr> <chr> <int>
1 00000001234 sample text 45
2 00000002345 good morning 30
3 00000003456 happy friday 24
The next step was to put information from above into a dtm format
My code is below:
lda_dtm <- word_counts %>%
cast_dtm(customer_id, word, n)
A warning message was raised:
Warning message:
Trying to compute distinct() for variables not found in the data:
- `row_col`, `column_col`
This is an error, but only a warning is raised for compatibility reasons.
The operation will return the input unchanged.
But the "lda_dtm" looks like its in the right format.
lda_dtm
<<DocumentTermMatrix (documents: 9517, terms: 341545)>>
Non-/sparse entries: 773250/3249710515
Sparsity : 100%
Maximal term length: NA
Weighting : term frequency (tf)
However when I tried to run lda, it did not work.
burnin <- 4000
iter <- 300
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 6
out_LDA <- LDA(lda_dtm,
k = k,
method="Gibbs",
control = list(nstart=nstart,
seed = seed,
best=best,
burnin = burnin,
iter = iter,
thin = thin))
The following warning was raised:
Error in seq.default(CONTROL_i@iter, control@burnin + control@iter, by = control@thin) :
wrong sign in 'by' argument
I don't see a topic modeling tutorial on the tidy text website for bi-grams, the tutorial was specifically for unigrams. How should I adjust the format for it to work with bi-grams?