I am working on clustering of documents by looking at it's structure.
I have extracted the structure in BERT embeddings variable X in the code below.
What I am trying:
for num_clusters in np.arange(2,200):
model = KMeans(n_clusters=num_clusters)
model.fit(X)
pred = model.predict(X)
centers = model.cluster_centers_
cluster_sum = 0
for i , c in enumerate(centers):
use = []
for j , p in enumerate(pred):
if p == i:
use.append(X[j])
score = 0
for m in range(len(use)):
for n in range(len(use)):
score+=cos_similarity(use[m],use[n])
score = score/(len(use)*len(use))
cluster_sum += score
cluster_sum=cluster_sum/num_clusters
I have written this code to find the similarity score of the cluster(combining similarity scores of all the clusters). Problem I am facing : with the increase in number of clusters the score is increasing.
How can I find the optimum number of clusters? This plot is for the Knee algorithm suggessted by @Cyrus in the answers. I am not able to see where should I draw the line.