"Difference" among Document Term Matrices

Question

Suppose I have a set of 100 documents, 70 speaking of politics and 30 speaking of math (a weird combination, I know that). My goal is to represent them on xy throught methods like the multidimensional scaling analysis, network analyses, som, etc. When I consider the whole set of documents I do like this:

I produce a corpus (docs) with 100 elements;
from the corpus I create a document term matrix (dtm);
from the dtm I create a matrix of the distances (dist) about the terms componing the documents or about the documents themselves (according to what I want to represent).

Obviously I can produce separate graphics for the two, but I'd like to do something different. I have three corpuses (docs_tot, docs_P, docs_M) and three document term matrices (dtm_tot, dtm_P, dtm_M).

Solutions:

1) representing the total of the documents on xy coloring differently the politics documents and the math ones. In this way I can see if they represent natural clusters on xy.

2) producing a network analysis on the differences. Is there a conceptual way to subtrack, for example, the dtm_P and the dtm_tot, knowing that the dtm_P has only a subset (70) of the dtm_tot documents (100)?

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

This sounds like you might consider using a comparison.cloud(). Here's an example from the help page of the wordcloud package:

library(tm)
library(wordcloud)
data(SOTU)
corp <- SOTU
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
corp <- tm_map(corp, removePunctuation)
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- c("SOTU 2010","SOTU 2011")
comparison.cloud(term.matrix,max.words=40,random.order=FALSE)

This also works for more than two groups, as shown, e.g., here.

Hope this helps.

It isn't exactly what I was looking for but... nevertheless, it's an interesting solution! — Andrea Ianni, Apr 01 '16 at 12:35

"Difference" among Document Term Matrices

1 Answers1