Using tidyr
and tidytext
you first unnest
the list column before replacing the white space with _
(you can use something else but _
is usually used to represent n-grams). This way the words are not separated when producing the tdm:
library(dplyr)
library(tidyr)
library(tidytext)
library(stringr)
# bring toy data into useful form
df <- tibble::tribble(
~doc_id, ~text,
1, c('cat dog', 'cat rat'),
2, c('cat dog'),
3, c('cat rat')
)
tdm <- df %>%
unnest(text) %>%
mutate(text = str_replace(text, "\\s+", "_")) %>% # replace whitespace
unnest_tokens(word, text) %>%
count(word, doc_id) %>%
cast_tdm(word, doc_id, n)
tdm
#> <<TermDocumentMatrix (terms: 2, documents: 3)>>
#> Non-/sparse entries: 4/2
#> Sparsity : 33%
#> Maximal term length: 7
#> Weighting : term frequency (tf)
To display it as a regular matrix:
tdm %>%
as.matrix()
#> Docs
#> Terms 1 2 3
#> cat_dog 1 1 0
#> cat_rat 1 0 1