0

Is there a way to make a data frame like this into a term document matrix? Each keyword consists of two or more words.

Example data

Data type is a data frame.

doc_id text
1      c('cat dog', 'cat rat')
2      c('cat dog')
3      c('cat rat')

Desired result

I want to get this result. The TermDocumentMatrix function already exists does not reflect a multiword keyword.

         Docs
Terms    1 2 3
cat dog  1 1 0
cat rat  1 0 1
JBGruber
  • 11,727
  • 1
  • 23
  • 45
pss
  • 3
  • 5

1 Answers1

1

Using tidyr and tidytext you first unnest the list column before replacing the white space with _ (you can use something else but _ is usually used to represent n-grams). This way the words are not separated when producing the tdm:

library(dplyr)
library(tidyr)
library(tidytext)
library(stringr)

# bring toy data into useful form
df <- tibble::tribble(
  ~doc_id, ~text, 
  1,      c('cat dog', 'cat rat'),
  2,      c('cat dog'),
  3,      c('cat rat')
)

tdm <- df %>% 
  unnest(text) %>% 
  mutate(text = str_replace(text, "\\s+", "_")) %>% # replace whitespace
  unnest_tokens(word, text) %>%
  count(word, doc_id) %>% 
  cast_tdm(word, doc_id, n)
tdm
#> <<TermDocumentMatrix (terms: 2, documents: 3)>>
#> Non-/sparse entries: 4/2
#> Sparsity           : 33%
#> Maximal term length: 7
#> Weighting          : term frequency (tf)

To display it as a regular matrix:

tdm %>% 
  as.matrix()
#>          Docs
#> Terms     1 2 3
#>   cat_dog 1 1 0
#>   cat_rat 1 0 1
JBGruber
  • 11,727
  • 1
  • 23
  • 45