2

I have recently done some document clustering using LSA then Kmeans. However when I try to print the most important words in each cluster im getting very strange results, it printing words that dont even below to that cluster.

below is the code and output:

# ------------------- LSA transformation ------------------------

from sklearn.decomposition import TruncatedSVD

lsa = TruncatedSVD(n_components= 7, n_iter=100)
lsa.fit(tv_matrix)
lsa_matrix = lsa.fit_transform(tv_matrix)

terms = tv.get_feature_names()

#--------------------- k means to create clusters  -------------------

X = lsa_matrix

km = KMeans(n_clusters=7, random_state=0)
km.fit_transform(X)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
X_df = pd.DataFrame(X)
result = pd.concat([corpus_df, cluster_labels], axis = 1 )

#--------printing common words in each cluster-------

common_words = km.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(terms[word] for word in centroid))

#-------------------------------------------------------

the output however is as follows:

0 : ability, ability basic, ability built, ability differentiate, ability add, ability control, ability find
1 : ability add, ability, ability differentiate, ability built, ability find, ability control, ability basic
2 : ability differentiate, ability, ability find, ability control, ability basic, ability add, ability built
3 : ability basic, ability, ability built, ability find, ability control, ability differentiate, ability add
4 : ability find, ability, ability basic, ability add, ability control, ability built, ability differentiate
5 : ability built, ability, ability find, ability control, ability differentiate, ability add, ability basic
6 : ability control, ability, ability add, ability basic, ability built, ability differentiate, ability find

The word ability isnt even in most of these clusters, can some one point out what im doing wrong?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Brian Ly
  • 21
  • 1

0 Answers0