9

I am trying to find the words occurring in multiple documents at the same time.

Let us take an example.

doc1: "this is a document about milkyway"
doc2: "milky way is huge"

As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not.

I am doing the following to get the document term matrix in R.

library(tm)
tmp.text <- data.frame(rbind(doc1, doc2))
tmp.corpus <- Corpus(DataframeSource(tmp.text))
tmpDTM <- TermDocumentMatrix(tmp.corpus, control = list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df <- as.data.frame(as.matrix(tmpDTM))
tmp.df

         1 2
document 1 0
huge     0 1
milky    0 1
milkyway 1 0
way      0 1

Term milkyway is only present in the first doc as per the above matrix.

I want to be able to get 1 in both the docs for term "milkyway" in the above matrix. This is just an example. I need to do this for a lot of documents. Ultimately I want to be able to treat such words ("milkyway" & "milky way") in a similar manner.

EDIT 1:

Can't I force the term document matrix to get calculated in such a way that for whatever word it is trying to look for it shouldn't just look for that word as a separate word in the string but also within strings? For example, one term is milky and there is a document this is milkyway so here currently milky does not occur in this document but if the algorithm looks for the word in question within strings also it will find the word milky within string milkyway, that way words milky and way will get counted in my both documents (earlier example).

EDIT 2:

Ultimately I want to be able to calculate similarity cosine index between documents.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
user3664020
  • 2,980
  • 6
  • 24
  • 45
  • Maybe remove spaces then use regex? – zx8754 Oct 13 '15 at 10:01
  • Do you only need to do this for 'milky way' or others? Do you prefer that they both be 'milkyway'? – sebastian-c Oct 13 '15 at 10:06
  • @sebastian-c i need to do this for multiple words. i prefer both to become "milkyway" in some way. There could be cases like "everyday" and "every day". In this case I would prefer them to be "everyday". – user3664020 Oct 13 '15 at 10:14
  • How would you know which words should be without spaces between them. I don't see any pattern here. – David Arenburg Oct 13 '15 at 10:31
  • Just top of my head, maybe `adist` could be of some use, having a space or separator in a word mean the levenstein distance between them would be 1, this add another complexity for similar words that's said... – Tensibai Oct 13 '15 at 13:04

4 Answers4

1

You will need to convert documents to a bag of primitive-word representation before. Where a primitive-word is matched with a set of words. The primitive word can also be in the corpus.

For instance:

milkyway -> {milky, milky way, milkyway} 
economy -> {economics, economy}
sport -> {soccer, football, basket ball, basket, NFL, NBA}

You can build such dictionary before computing the cosine distance with both a synonyms dictionary and a edit distance like levenstein which will complete synonym dictionary.

Computing 'sport' key is more involving.

amirouche
  • 7,682
  • 6
  • 40
  • 94
0

You can use regex to match for every possible split of the words, by inserting "\\s?" between every character in your search words. If you only want specific splits, you just insert it at those places. The following code generates a regex pattern for the search terms, by inserting "\\s?" between every character. grep returns the index for where the pattern matches, but can be exchanged for other regex functions.

docs <- c("this is a document about milkyway",  "milky way is huge")
search_terms <- c("milkyway", "document")
pattern_fix <- sapply(strsplit(search_terms, split = NULL), paste0, collapse = "\\s?")
sapply(pattern_fix, grep, docs)

$`m\\s?i\\s?l\\s?k\\s?y\\s?w\\s?a\\s?y`
[1] 1 2

$`d\\s?o\\s?c\\s?u\\s?m\\s?e\\s?n\\s?t`
[1] 1

Edit:

To search for all words, you could just use the names of tmp.df in your script as the search_terms in my solution.

doc1 <- "this is a document about milkyway"
doc2 <- "milky way is huge"

library(tm)
tmp.text<-data.frame(rbind(doc1,doc2))
tmp.corpus<-Corpus(DataframeSource(tmp.text))
tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df<-as.data.frame(as.matrix(tmpDTM))
tmp.df

search_terms <- row.names(tmp.df)
pattern_fix <- sapply(strsplit(search_terms, split = NULL), paste0, collapse = "\\s?")
names(pattern_fix) <- search_terms
word_count <- sapply(pattern_fix, grep, tmp.text[[1]])
h_table <- sapply(word_count, function(x) table(factor(x, levels = 1:nrow(tmp.text)))) #horizontal table
v_table <- t(h_table) #vertical table (like tmp.df)
v_table

         1 2
document 1 0
huge     0 1
milky    1 1
milkyway 1 1
way      1 1
JohannesNE
  • 1,343
  • 9
  • 14
  • thanks for making an effort. But your solution requires me to explicitly mention the terms which i want to match which I don't know in advance. See my EDIT 1 and EDIT 2 if that helps you to come up with a better solution. – user3664020 Oct 13 '15 at 12:39
  • See my edit. There may be a better way, but this works for this short example at least. – JohannesNE Oct 14 '15 at 07:33
0

Here's a solution that requires no preset lists of words, but performs the separation by tokenising the text as bigrams where there is no separator character between the adjacent words, and then looking for matches in the unigram tokenisation. These are then saved, and later replaced in the texts with the separated versions.

This means that no pre-set lists are required, but that only those are unparsed that have equivalent parsed versions in the text. Note that this might generate false positives such as "berated" and "be rated" which might not be occurrences of the same pair, but rather a valid unigram as in the first term, distinct from the equivalent concatenated bigram in the second term. (No perfect solution to this particular problem exists.)

This solution requires the quanteda package for text analysis, and the stringi package for vectorised regex replacement.

# original example
myTexts <- c(doc1 = "this is a document about milkyway", doc2 = "milky way is huge")

require(quanteda) 

unparseMatches <- function(texts) {
    # tokenize all texts
    toks <- quanteda::tokenize(toLower(texts), simplify = TRUE)
    # tokenize bigrams
    toks2 <- quanteda::ngrams(toks, 2, concatenator = " ")
    # find out which compressed pairs exist already compressed in original tokens
    compoundTokens <- toks2[which(gsub(" ", "", toks2) %in% toks)]
    # vectorized replacement and return
    result <- stringi::stri_replace_all_fixed(texts, gsub(" ", "", compoundTokens), compoundTokens, vectorize_all = FALSE)
    # because stringi strips names
    names(result) <- names(texts)
    result
}

unparseMatches(myTexts)
##                                 doc1                                 doc2 
##  "this is a document about milky way"                 "milky way is huge" 
quanteda::dfm(unparseMatches(myTexts), verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
##       features
## docs   this is a document about milky way huge
##   doc1    1  1 1        1     1     1   1    0
##   doc1    0  1 0        0     0     1   1    1


# another test, with two sets of phrases that need to be unparsed 
testText2 <- c(doc3 = "This is a super duper data set about the milky way.",
               doc4 = "And here is another superduper dataset about the milkyway.")
unparseMatches(testText2)
##                                                            doc3                                                            doc4 
##           "This is a super duper data set about the milky way." "And here is another super duper data set about the milky way." 
(myDfm <- dfm(unparseMatches(testText2), verbose = FALSE))
## Document-feature matrix of: 2 documents, 14 features.
## 2 x 14 sparse Matrix of class "dfmSparse"
##       features
## docs   this is a super duper data set about the milky way and here another
##   doc3    1  1 1     1     1    1   1     1   1     1   1   0    0       0
##   doc4    0  1 0     1     1    1   1     1   1     1   1   1    1       1

quanteda can also do similarity computations such as cosine distance:

quanteda::similarity(myDfm, "doc3", margin = "documents", method = "cosine")
##      doc4   <NA> 
##    0.7833     NA 

I'm not sure what the NA is -- it appears to be bug in the output when there is just one document to compare to a two-document set. (I'll fix this soon, but the result is still correct.)

Ken Benoit
  • 14,454
  • 27
  • 50
0

As Ken already stated:

(No perfect solution to this particular problem exists.)

For all I know this is absolutely right and backed by many text books and journals on text mining - usually within the first few paragraphs.

In my research I rely on already prepared datasets like the „Deutscher Wortschatz“ project. There they have already done the hard work and present high quality lists of synonyms, antonyms, polysemic terms etc. This project i.a. provides an interface access via soap. A Database for the English language is Wordnet, e.g..

If you do not want to use a precalculated set or cannot afford it I suggest you go with amirouche's approach and primitive-word representations. Building them by word is tedious and labour-intensive, yet the most viable approach.

Every other method which comes to my mind is definitely way more complex. Just see the other answers or the state-of-the-art approach taken from „Text Mining, Wissensrohstoff Text“ by G. Heyer, U. Quasthoff and T. Wittig, like clustering on word forms via (1) identification of characteristic features (index-terms), (2) Creation of Term-Sentence-Matrix and choosing a weighting for calculating a term-term-matrix, (3) choosing a similarity measure and running it on your term-term-matrix and finally (4) pick and run a clustering algorithm.

I would you suggest you mark amirouche's post as a correct answer because this is so far the best and most practicable way of doing things (I know of).

Dennis Proksch
  • 240
  • 2
  • 9