2

I am working on clustering of documents by looking at it's structure.

I have extracted the structure in BERT embeddings variable X in the code below.

What I am trying:

for num_clusters in np.arange(2,200):
    model = KMeans(n_clusters=num_clusters)
    model.fit(X)
    pred = model.predict(X)
    centers = model.cluster_centers_

    cluster_sum = 0
    for i , c in enumerate(centers):
        use = []
        for j , p in enumerate(pred):
            if p == i:
                use.append(X[j])
        score = 0
        for m in range(len(use)):
            for n in range(len(use)):
                score+=cos_similarity(use[m],use[n])
        score = score/(len(use)*len(use))
        cluster_sum += score
    cluster_sum=cluster_sum/num_clusters

I have written this code to find the similarity score of the cluster(combining similarity scores of all the clusters). Problem I am facing : with the increase in number of clusters the score is increasing.

How can I find the optimum number of clusters? This plot is for the Knee algorithm suggessted by @Cyrus in the answers. I am not able to see where should I draw the line.

enter image description here

Darth Vader
  • 881
  • 2
  • 7
  • 24

2 Answers2

3

There are quite a few topics to point you the in the right direction. You can look into a few like :

  1. Elbow Method
  2. Silhouette Analysis
  3. Different Type of clustering algorithms that do not rely on giving number of clusters upfront (such as DBSCAN)

Hope this helps!

Cyrus Dsouza
  • 883
  • 7
  • 18
  • Manish Prasad, details are on Google. Please type - (how to find optimal clusters in k-means?) - and you will get multiple topics that you can look into. – Cyrus Dsouza May 13 '20 at 13:37
1

My answer addresses more the mathematical side of your question:

The implementation of sklearn's KMeans uses Euclidean distance to measure the dissimilarity between data points in input data. However you seem to be trying to evaluate clustering quality with cosine similarity — a different distance measure clustering result has been optimized for. This could explain the increase in cluster score as the number of clusters increase.

Have you noticed that KMeans has inertia_ attribute which corresponds to sum of squared distances of samples to their closest cluster center; this can be considered as a valid cluster score for KMeans using Euclidean distance.

I am glad if this helps you!

kampmani
  • 680
  • 5
  • 13
  • 1
    Using the euclidean distance for creating the score help me in any way? It is the Bert embeddings that i am using for clustering. – Darth Vader May 13 '20 at 12:23
  • 1
    I tried using the the euclidean distance for computing score ..and the score is decreasing almost continuosly for increasing number of cluster centers. – Darth Vader May 13 '20 at 12:33
  • 1
    `KMeans` implementation you are using can't use anything else except Euclidean distance and the suitable score is recorded in `inertia_`. You can use `KMeans` for Bert embeddings but pay attention to [Curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) in high-dimensional spaces. Euclidean distance is quite sensitive to it. – kampmani May 13 '20 at 12:41