2

Suppose I have a set of 100 documents, 70 speaking of politics and 30 speaking of math (a weird combination, I know that). My goal is to represent them on xy throught methods like the multidimensional scaling analysis, network analyses, som, etc. When I consider the whole set of documents I do like this:

  • I produce a corpus (docs) with 100 elements;
  • from the corpus I create a document term matrix (dtm);
  • from the dtm I create a matrix of the distances (dist) about the terms componing the documents or about the documents themselves (according to what I want to represent).

Obviously I can produce separate graphics for the two, but I'd like to do something different. I have three corpuses (docs_tot, docs_P, docs_M) and three document term matrices (dtm_tot, dtm_P, dtm_M).

Solutions:

1) representing the total of the documents on xy coloring differently the politics documents and the math ones. In this way I can see if they represent natural clusters on xy.

2) producing a network analysis on the differences. Is there a conceptual way to subtrack, for example, the dtm_P and the dtm_tot, knowing that the dtm_P has only a subset (70) of the dtm_tot documents (100)?

Andrea Ianni
  • 829
  • 12
  • 24

1 Answers1

2

This sounds like you might consider using a comparison.cloud(). Here's an example from the help page of the wordcloud package:

library(tm)
library(wordcloud)
data(SOTU)
corp <- SOTU
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
corp <- tm_map(corp, removePunctuation)
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- c("SOTU 2010","SOTU 2011")
comparison.cloud(term.matrix,max.words=40,random.order=FALSE)

enter image description here

This also works for more than two groups, as shown, e.g., here.

Hope this helps.

Community
  • 1
  • 1
RHertel
  • 23,412
  • 5
  • 38
  • 64