1

I have finished implementing the traditional k-means text clustering. However, right now, I need to revise my program to "spherical k-means text clustering" but have not succeeded yet.

I've searched for solutions on sites but still cannot revise my program successfully. The followings are the resources that should be helpful with my project but I still cannot figure out a way yet.

  1. https://github.com/jasonlaska/spherecluster
  2. https://github.com/khyatith/Clustering-newsgroup-dataset
  3. Spherical k-means implementation in Python

This is my traditional K-means program:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.externals import joblib #store model

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tag_document) //tag_document is a list that contains many strings

true_k = 3 //assume that i want to have 3 clusters
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

#store
 joblib.dump(model,'save/cluster.pkl')

#restore
clu2 = joblib.load('save/cluster.pkl')


order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

I expect to cluster text documents with "spherical k-means clustering".

joyce chiu
  • 49
  • 7

1 Answers1

0

First you need to check if your texts are similar when the cosine distance between two similar texts is small. After that, you can just normalize vectors and cluster with kmeans.

I did something like this:

k = 20
kmeans = KMeans(n_clusters=k,init='random', random_state=0)
normalizer = Normalizer(copy=False)
sphere_kmeans = make_pipeline(normalizer, kmeans)

sphere_kmeans = sphere_kmeans.fit_transform(word2vec-tfidf-vectors)