Spectral clustering with Similarity matrix constructed by jaccard coefficient

Question

I have a categorical dataset, I am performing spectral clustering on it. But I do not get very good output. I choose the eigen vectors corresponding to largest eigen values as my centroids for k-means.

Please find below the process I follow:

1. Create a symmetric similarity matrix (m*m) using jaccard coefficient.
   For example, for a data set,
   a,b,c,d
   a,b,x,y
   The similarity matrix I compute would look like :
   |1       0.33|
   |0.33     1  |
2. Compute the first k eigen vectors corresponding to largest eigen values. where k is the number of cluster.
3. Normalize the symmetric similarity matrix
4. perform the clustering on the normalized similarity matrix using eigen vectors as initial centroids for k-means.

My questions are :

Is computing Jaccard similarity matrix the right choice for spectral clustering.

Is it the right way of selecting eigen vectors as cluster centroids for spectal clustering because I dont see other options for categorical dataset.

Is there anything wrong with the procedure I follow.

score 1 · Accepted Answer · answered Jun 10 '15 at 20:49

1

As far as I can tell, you have mixed and shuffled aa number of approaches. No wonder it doesn't work...

you could simply use jaccard distance (a simple inversion of jaccard similarity) + hierachical clustering
you could do MDS to project you data, then k-means (probably what you are trying to do)
affinity propagation etc. are worth a try

answered Jun 10 '15 at 20:49

Has QUIT--Anony-Mousse

76,138
12
138
194

Thanks for your reply, I am just a beginner in the field of cluster analysis just trying out different approaches. Need to ask another thing. Would creating a similarity matrix (m*m) using jaccard coefficient and then performing k-means on the matrix do any good. Is it a viable approach? I tried using it for few data sets in http://archive.ics.uci.edu/ml/datasets.html (congress, mushroom), it gives promising results. Thanks – Sam Jun 10 '15 at 21:21
k-means should be run on the raw data. it is meant for a linear, euclidean, vector space. **Don't run methods just because you can**. Understand the requirements and objectives of the algorithm *and* your problem. If you can get them to align (which usually will need substantial preprocessing) then give it a try. – Has QUIT--Anony-Mousse Jun 10 '15 at 21:35

Spectral clustering with Similarity matrix constructed by jaccard coefficient

1 Answers1