I am doing some clustering of documents using cosine similarity between each document. This is fine. However my problem is a little strange in that I only want to cluster certain documents with others, not all of the documents against each other. Here's an example...
I have two spreadsheets with 3 labels apiece. I want to cluster the labels that are similar to each other BETWEEN the documents but not in the internal of the document, so for instance
Doc1: has labels: sex and gender, tobacco use years, current age
Doc2: has labels: gender, age now, time of use
I want to cluster the labels between the two documents but not inside the document, so I've created a similarity matrix that looks like this:
d1_l1 d1_l2 d1_l3 d2_l1 d2_l2 d2_l3
d1_l1 1.0000000 NA NA 0.5773503 0.0 0.0000000
d1_l2 NA 1.0000000 NA 0.0000000 0.0 0.3333333
d1_l3 NA NA 1.0 0.0000000 0.5 0.0000000
d2_l1 0.5773503 0.0000000 0.0 1.0000000 NA NA
d2_l2 0.0000000 0.0000000 0.5 NA 1.0 NA
d2_l3 0.0000000 0.3333333 0.0 NA NA 1.0000000
where the cosine similarity between labels in the same document is set as NA. The problem is that agnes and other hierarchical clustering methods don't accept NA values. So what should I do? Am I thinking about this the wrong way?