Suppose I have a set of 100
documents, 70
speaking of politics and 30
speaking of math (a weird combination, I know that). My goal is to represent them on xy throught methods like the multidimensional scaling analysis, network analyses, som, etc. When I consider the whole set of documents I do like this:
- I produce a corpus (docs) with 100 elements;
- from the corpus I create a document term matrix (dtm);
- from the dtm I create a matrix of the distances (dist) about the terms componing the documents or about the documents themselves (according to what I want to represent).
Obviously I can produce separate graphics for the two, but I'd like to do something different. I have three corpuses (docs_tot, docs_P, docs_M) and three document term matrices (dtm_tot, dtm_P, dtm_M).
Solutions:
1) representing the total of the documents on xy coloring differently the politics documents and the math ones. In this way I can see if they represent natural clusters on xy.
2) producing a network analysis on the differences. Is there a conceptual way to subtrack, for example, the dtm_P and the dtm_tot, knowing that the dtm_P has only a subset (70) of the dtm_tot documents (100)?