1

I have been trying to calculate the Silhouette coeffecient for the clusters I have created using KModes clustering (since all of my data fields are categorical). I am using matching dissimilarity as the distance measure.

def matching_disimilarity(a, b):
    return np.sum(a != b)

Since I could not find any such implementation in Python on the internet, I decided to write one myself following the Wikipedia documentation - https://en.wikipedia.org/wiki/Silhouette_(clustering). Here's what I have so far.

def silhouette_analysis(df):
    n_clusters = 5
    sil = []

    for i, r_i in df.iterrows():
        c_i = r_i['cluster']
        r_i = r_i.drop('cluster', axis=0)
        same_cluster_df = df[df['cluster'] == c_i].reset_index(drop=True)
        other_clusters_df = df[df['cluster'] != c_i].reset_index(drop=True)

        a_i = 0
        for j, r_j in same_cluster_df.iterrows():
            r_j = r_j.drop('cluster', axis=0)
            d_ij = matching_disimilarity(r_i, r_j)
            a_i += d_ij
        a_i = a_i/(len(same_cluster_df) - 1)

        b_i = []
        b_in = 0
        for c_n in range(n_clusters):
            if c_i == c_n: continue
            nearest_cluster_df = other_clusters_df[other_clusters_df['cluster'] == c_n]
            for j, r_j in nearest_cluster_df.iterrows():
                r_j = r_j.drop('cluster', axis=0)
                d_ij = matching_disimilarity(r_i, r_j)
                b_in += d_ij
            b_in = b_in/len(nearest_cluster_df)
            b_i.append(b_in)
        b_i = min(b_i)

        if (a_i < b_i):
            s_i = 1 - (a_i/b_i)
        elif(a_i == b_i):
            s_i = 0
        else:
            s_i = b_i/a_i - 1

        sil.append(s_i)

    df['sil'] = sil
    return df

The dataframe df that I am passing as the argument has the clusters already mapped to each row in the cluster column.

There are 3 questions I want to ask:

  1. Is my code correct? Will it give me the correct evaluation of my clusters?
  2. This is super slow right now. I have nearly 20k rows and it is taking more than 2 minutes to calculate silhouette coeff. for a single row.
  3. Is there any existing and reliable python implementation of Silhouette coeff. for KModes clustering using Matching dissimilarity as distance measure.
asanoop24
  • 449
  • 4
  • 13
  • You can check with the source code of sklearn's implementation of silhouette, [here](https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/metrics/cluster/_unsupervised.py#L38) – null May 12 '20 at 12:05
  • @null..I did infact check the sklearn implementation but it doesn't let me use the matching dissimilarity as a distance measure. It has pre-defined distance metrics which work for the numerical data but not for categorical data. – asanoop24 May 12 '20 at 12:22
  • Referring to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), see the parameter "metric". If you pass your distance matrix and pass `metric="precomputed"` then it will treat that as a distance matrix. Moreover, metric can be callable, so you can try to pass your function `matching_disimilarity` in a proper way to parameter "metric". – null May 12 '20 at 13:40
  • @null..don't want to use "pre-computed" as it's taking a lot of time to calculate the distance matrix with my code. Let me try the "callable" method. – asanoop24 May 12 '20 at 14:16

1 Answers1

0

I know that this was a long time ago. I would be curious to know what your final solution looked like along with its overall accuracy. I'm currently working through something similar and after studying the algorithm, I'm currently in the throws of building my own.

As for speeding up performance, it looks like you might be able to precompute certain values. For example, since you know what the cluster values are wouldn't it be possible to split the incoming DF into its component cluster DFs before iterating over the incoming DF.

Further, at that time, couldn't you also then drop the cluster column at that time as well rather than on a row by row basis? Better still, you could possibly add a parameter that specifies the category columns to be used which would allow you to eliminate the dropping of the column.

  • 2
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 14 '22 at 21:07