I am trying to use tidytext
to transform a tibble of word frequencies into a DocumentTermMatrix, but the function doesn't seem to work as expected. I start from AssociatedPress
which I know is a documentTermMatrix, tidy and cast it back, but the output is not the same as the original matrix. What am I doing wrong?
library(topicmodels)
data(AssociatedPress)
ap_td <- tidy(AssociatedPress)
tt <- ap_td %>%
cast_dtm(document, term, count)
The element $Docs
is not-NULL when I cast ap_td
but it was NULL in AssociatedPress
:
str(tt)
List of 6
$ i : int [1:302031] 1 16 35 72 84 93 101 111 155 161 ...
$ j : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : chr [1:2246] "1" "2" "3" "4" ...
..$ Terms: chr [1:10473] "adding" "adult" "ago" "alcohol" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : num [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
cast_dtm
retrieves a warning
Warning message: Trying to compute distinct() for variables not found in the data: -
row_col
,column_col
This is an error, but only a warning is raised for compatibility reasons. The operation will return the input unchanged.
On GitHub, I found this issue which should have been fixed now.