Use a similarity function for clustering scikit-learn

Question

I use a function to calculate similarity between a pair of documents and wanto perform clustering using this similarity measure.
Code so Far

Sim=np.zeros((n, n)) # create a numpy arrary  
i=0  
j=0       
for i in range(0,n):      
   for j in range(i,n):  
    if i==j:  
        Sim[i][j]=1
     else:    
         Sim[i][j]=simfunction(list_doc[i],list_doc[j]) # calculate similarity between documents i and j using simfunction
Sim=Sim+ Sim.T - np.diag(Sim.diagonal()) # complete the symmetric matrix

AggClusterDistObj=AgglomerativeClustering(n_clusters=num_cluster,linkage='average',affinity="precomputed") 
Res_Labels=AggClusterDistObj.fit_predict(Sim)

My concern is that here I used a similarity function , and I think as per documents it should be a disimilarity matrix, how can I change it to dissimilarity matrix. Also what would be a more efficient way to do this.

score 5 · Accepted Answer · answered Oct 02 '14 at 07:02

Please format your code correctly, as indentation matters in Python.
If possible, keep the code complete (you left out a import numpy as np).
Since range always starts from zero, you can omit it and write range(n).

Indexing in numpy works like [i, j, k, ...].
So instead of Sim[i][j] you actually want to write Sim[i, j], because otherwise you do two operations: first taking the entire row slice and then indexing the column. Heres another way to copy the elements of the upper triangle to the lower one

Sim = np.identity(n) # diagonal with ones (100 percent similarity)

for i in range(n):      
    for j in range(i+1, n):    # +1 skips the diagonal 
        Sim[i, j]= simfunction(list_doc[i], list_doc[j])

# Expand the matrix (copy triangle)
tril = np.tril_indices_from(Sim, -1) # take lower & upper triangle's indices
triu = np.triu_indices_from(Sim, 1)  # (without diagonal)
Sim[tril] = Sim[triu]

Assumed tha you really have similarities within the range (0, 1) to convert your similarity matrix into a distance matrix you can then simply do

dm = 1 - Sim

This operation will be vectorized by numpy

Use a similarity function for clustering scikit-learn

1 Answers1