calculate term document matrix while looking for words within strings also

Question

This question is related to to my earlier question. Treat words separated by space in the same manner

Posting it as a separate one since it might help other users find it easily.

The question is regarding the way the term document matrix is calculated by tm package currently. I want to tweak this way a little bit as explained below.

Currently any term document matrix gets created by looking for a word say 'milky' as a separate word (and not as a string) in a document. For example, let us assume 2 documents

 document 1: "this is a milky way galaxy"
 document 2: "this is a milkyway galaxy"

As per the way current algorithm works (tm package) 'milky' would get found in first document but not in second document since the algorithm looks for the term milky as a separate word. But if the algorithm had looked for the term milky a strings like function grepl does, it would have found the term 'milky' in second document as well.

grepl('milky', 'this is a milkyway galaxy')
TRUE

Can someone please help me create a term document matrix meeting my requirement (which is to be able to find term milky in both the documents. Please note that I don't want a solution specific to a word or milky, I want a general solution which I will apply on a larger scale to take care of all such cases)? Even if the solution does not use tm package, it is fine. I just have to get a term document matrix meeting my requirement in the end. Ultimately I want to be able to get a term document matrix such that each term in it should get looked for as string (not just as word) inside all the strings of the document in question (grepl like functionality while calculating term document matrix).

Current code which I use to get term document matrix is

doc1 <-  "this is a document about milkyway"
doc2 <-  "milky way is huge"

library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df

         1 2
document 1 0
huge     0 1
milky    0 1
milkyway 1 0
way      0 1

@pcantalupo where do I use this `/b`? The problem is not specific to just 'milky' as I have explained. 'milky' is just an example. Ultimately I want to be able to create a term document matrix which gets calculated in a way such that each term should be looked for within strings of the document also. — user3664020, Oct 13 '15 at 13:34
`grepl('\\bmilky\\b', 'this is a milkyway galaxy')` will return FALSE — pcantalupo, Oct 13 '15 at 13:46
@pcantalupo i don't want to be rude. But you don't seem to get the question. Please read it again. — user3664020, Oct 13 '15 at 13:51
This really isn't straightforward to solve. What if you have the word "justice". Should that also be a bit for "ice"? Or should "discovery" match "disco"? An algorithm would need to have a understanding of what words are related. That is unless you want to compare all possible substrings, but then the number of "terms" will explode quickly. You may wish to maintain your own list of words that you want to split. — MrFlick, Oct 13 '15 at 14:19

score 0 · Answer 1 · answered Oct 15 '15 at 20:03

I am not sure that tm makes it easy (or possible) to select or group features based on regular expressions. But the text package quanteda does, through a thesaurus argument that groups terms according to a dictionary, when constructing its document-feature matrix.

(quanteda uses the generic term "feature" since here, your category is terms containing the phrase milky rather than original "terms".)

The valuetype argument can be the "glob" format (default), a regular expression ("regex"), or as-is fixed ("fixed"). Below I show the versions with glob and regular expressions.

require(quanteda)
myDictGlob <- dictionary(list(containsMilky = c("milky*")))
myDictRegex <- dictionary(list(containsMilky = c("^milky")))

(plainDfm <- dfm(c(doc1, doc2)))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 2 documents
## ... indexing features: 9 feature types
## ... created a 2 x 9 sparse dfm
## ... complete. 
## Elapsed time: 0.008 seconds.
## Document-feature matrix of: 2 documents, 9 features.
## 2 x 9 sparse Matrix of class "dfmSparse"
## features
## docs    this is a document about milkyway milky way huge
## text1    1  1 1        1     1        1     0   0    0
## text2    0  1 0        0     0        0     1   1    1

dfm(c(doc1, doc2), thesaurus = myDictGlob, valuetype = "glob", verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
##       this is a document about way huge CONTAINSMILKY
## text1    1  1 1        1     1   0    0             1
## text2    0  1 0        0     0   1    1             1
dfm(c(doc1, doc2), thesaurus = myDictRegex, valuetype = "regex")
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
##       this is a document about way huge CONTAINSMILKY
## text1    1  1 1        1     1   0    0             1
## text2    0  1 0        0     0   1    1             1

calculate term document matrix while looking for words within strings also

1 Answers1