I am trying to find the words occurring in multiple documents at the same time.
Let us take an example.
doc1: "this is a document about milkyway"
doc2: "milky way is huge"
As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not.
I am doing the following to get the document term matrix in R.
library(tm)
tmp.text <- data.frame(rbind(doc1, doc2))
tmp.corpus <- Corpus(DataframeSource(tmp.text))
tmpDTM <- TermDocumentMatrix(tmp.corpus, control = list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df <- as.data.frame(as.matrix(tmpDTM))
tmp.df
1 2
document 1 0
huge 0 1
milky 0 1
milkyway 1 0
way 0 1
Term milkyway
is only present in the first doc as per the above matrix.
I want to be able to get 1 in both the docs for term "milkyway" in the above matrix. This is just an example. I need to do this for a lot of documents. Ultimately I want to be able to treat such words ("milkyway" & "milky way") in a similar manner.
EDIT 1:
Can't I force the term document matrix to get calculated in such a way that for whatever word it is trying to look for it shouldn't just look for that word as a separate word in the string but also within strings? For example, one term is milky
and there is a document this is milkyway
so here currently milky
does not occur in this document but if the algorithm looks for the word in question within strings also it will find the word milky
within string milkyway
, that way words milky
and way
will get counted in my both documents (earlier example).
EDIT 2:
Ultimately I want to be able to calculate similarity cosine index between documents.