0

I am trying to use tidytext to transform a tibble of word frequencies into a DocumentTermMatrix, but the function doesn't seem to work as expected. I start from AssociatedPress which I know is a documentTermMatrix, tidy and cast it back, but the output is not the same as the original matrix. What am I doing wrong?

library(topicmodels)
data(AssociatedPress)
ap_td <- tidy(AssociatedPress)
tt <- ap_td %>%
  cast_dtm(document, term, count)

The element $Docs is not-NULL when I cast ap_td but it was NULL in AssociatedPress: str(tt)

List of 6
 $ i       : int [1:302031] 1 16 35 72 84 93 101 111 155 161 ...
 $ j       : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2246
 $ ncol    : int 10473
 $ dimnames:List of 2
  ..$ Docs : chr [1:2246] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:10473] "adding" "adult" "ago" "alcohol" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

List of 6
 $ i       : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
 $ v       : num [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
 $ nrow    : int 2246
 $ ncol    : int 10473
 $ dimnames:List of 2
  ..$ Docs : NULL
  ..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

cast_dtm retrieves a warning

Warning message: Trying to compute distinct() for variables not found in the data: - row_col, column_col This is an error, but only a warning is raised for compatibility reasons. The operation will return the input unchanged.

On GitHub, I found this issue which should have been fixed now.

pogibas
  • 27,303
  • 19
  • 84
  • 117
Dambo
  • 3,318
  • 5
  • 30
  • 79

1 Answers1

1

I don't get your warning message using tidytext 0.1.9.900 and R 3.5.0.

The dtm's are identical for the number of terms, rows and columns. Also all the counts are correct.

The difference is indeed between the $dimnames$Docs of tt$dimnames$Docs and AssociatedPress$dimnames$Docs.

The reason for this is that if there are no docids in the dtm before tidying as is the case with AssociatedPress, the tidy function assigns AssociatedPress$i to the document variable in the tidy_text (ap_td). Casting this back into a dtm, will fill the $dimnames$Docs with the document value from the tidy_text data.frame (ap_td). So in the end the AssociatedPress$i values will end up in tt$dimnames$Docs.

You can see that if you compare the $i from Associated Press with the Docs from tt.

all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs))
[1] TRUE

Or comparing from AssociatedPress to ap_td to tt:

all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs), unique(ap_td))
[1] TRUE

If you want to follow the logic yourself, you can check all the functions used on the github page for the sparse_tidiers. Start with tidy.DocumentTermMatrix and follow the function calls to tidy.simple_triplet_matrix and finally to tidy_triplet.

phiver
  • 23,048
  • 14
  • 44
  • 56