0

I am doing some clustering of documents using cosine similarity between each document. This is fine. However my problem is a little strange in that I only want to cluster certain documents with others, not all of the documents against each other. Here's an example...

I have two spreadsheets with 3 labels apiece. I want to cluster the labels that are similar to each other BETWEEN the documents but not in the internal of the document, so for instance

  • Doc1: has labels: sex and gender, tobacco use years, current age

  • Doc2: has labels: gender, age now, time of use

I want to cluster the labels between the two documents but not inside the document, so I've created a similarity matrix that looks like this:

          d1_l1         d1_l2     d1_l3     d2_l1      d2_l2     d2_l3
    d1_l1 1.0000000        NA        NA     0.5773503   0.0    0.0000000
    d1_l2        NA 1.0000000        NA     0.0000000   0.0    0.3333333
    d1_l3        NA        NA        1.0    0.0000000   0.5    0.0000000
    d2_l1 0.5773503 0.0000000        0.0    1.0000000    NA           NA
    d2_l2 0.0000000 0.0000000        0.5           NA   1.0           NA
    d2_l3 0.0000000 0.3333333        0.0           NA    NA    1.0000000

where the cosine similarity between labels in the same document is set as NA. The problem is that agnes and other hierarchical clustering methods don't accept NA values. So what should I do? Am I thinking about this the wrong way?

user2680293
  • 79
  • 1
  • 5

0 Answers0