The 'dictionary' parameter of TermDocumentMatrix does not work in R

Question

Even though I added the keyword to 'dictionary' as below code, it doesn't extract from the sentence.

Sample code

library(tm)

data = c('a', 'a b', 'c')
keyword = c('a', 'b')

data = VectorSource(data)
corpus = VCorpus(data)
tdm = TermDocumentMatrix(corpus, control = list(dictionary = keyword))

Result of my code above

inspect(tdm)

<<TermDocumentMatrix (terms: 2, documents: 3)>>
Non-/sparse entries: 0/6
Sparsity           : 100%
Maximal term length: 1
Weighting          : term frequency (tf)
Sample             :
Docs
Terms 1 2 3
    a 0 0 0
    b 0 0 0

Normal result should be as follows:

Terms 1 2 3
    a 1 1 0
    b 0 1 0

score 0 · Accepted Answer · answered Sep 11 '19 at 07:55

0

You have to pass the minimum word length to termFreq control.

tdm = TermDocumentMatrix(corpus, control = list(dictionary = keyword, wordLengths = c(1, Inf)))
as.matrix(tdm)

     Docs
Terms 1 2 3
    a 1 1 0
    b 0 1 0

answered Sep 11 '19 at 07:55

erocoar

5,723
3
23
45

How can I extract words containing multiple words? For example... `data = c('dog cat', 'dog', 'cat')` `keyword = c('dog cat', 'dog', 'cat')` In this state, I want to extract the keyword 'dog cat'. – pss Sep 11 '19 at 08:06
In that case you'll probably need to pass a custom tokenization function to `TermDocumentMatrix`. Ideally, though, you'd use bigram scoring similar to `Gensim`'s `Phraser`, but I don't think it is implemented in R yet – erocoar Sep 11 '19 at 08:16
It seemed simple, but complicated. Thank you. – pss Sep 11 '19 at 08:54

The 'dictionary' parameter of TermDocumentMatrix does not work in R

Sample code

Result of my code above

Normal result should be as follows:

1 Answers1