0

I am using WEKA for performing text collection. Suppose i have n documents with text, i calculated TFID as feature vector for each document and than calculated cosine similarity between each of each of the document.it generated nXn matrix. Now i wonder how to use this nxn matrix in k-mean algorithm . i know i can apply some dimension reduction such as MDS or PCA. What I am confused here is that after applying dimension reduction how will i identify that document itself, for example if i have 3 documents d1,d2 d3 than cosine will give me distances between d11,d12,d13 d21,d22,d23 d31,d32,d33 now i am not sure what will be output after PCA or MDS and how i will identify the documents after kmean. Please suggest. I hope i have put my question clearly

Nhqazi
  • 732
  • 3
  • 12
  • 30

1 Answers1

0

PCA is used on the raw data, not on distances, i.e. PCA(X).

MDS uses a distance function, i.e. MDS(X, cosine).

You appear to believe you need to run PCA(cosine(X))? That doesn't work.

You want to run MDS(X, cosine).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • thanks for reply.however my query is that once you get the result from MDS now how will you use this data for clustering. in simple words I am not clear how to use the nxn distance matrix calculated from cosine similarity function in to kmean or any other cluster algorithm.What confuses me to use this(square matrix of distances) is, that now dimension is changed from n to nxn. any suggestion please. – Nhqazi Jun 22 '16 at 08:59
  • That's solved by MDS. After MDS you have a matrix of coordinates, not a distance matrix. – Has QUIT--Anony-Mousse Jun 22 '16 at 11:13
  • thanks. but i guess MDS will not give me any cluster. I wish to group the data in cluster. MDS will be giving x,y coordinate , how to use it in k mean than? – Nhqazi Jun 22 '16 at 11:24
  • Yes, you can try using k-means on the output of MDS. – Has QUIT--Anony-Mousse Jun 22 '16 at 11:27