26

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a list of documents as shown below:

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

What changes do i need to do in kMeans example code to use this list as input? (Simply taking 'dataset = documents' doesn't work)

Nabila Shahid
  • 419
  • 1
  • 6
  • 13

1 Answers1

77

This is a simpler example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

vectorize the text i.e. convert the strings to numeric features

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

cluster documents

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

print top terms per cluster clusters

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print

If you want to have a more visual idea of how this looks like see this answer.

Community
  • 1
  • 1
elyase
  • 39,479
  • 12
  • 112
  • 119
  • thank u but it gives me syntax errors in print commands at end ='' and print() ... how do i make it work? :s – Nabila Shahid Jan 11 '15 at 18:13
  • 1
    Oh, that is because I am Python 3, I edited my answer. – elyase Jan 11 '15 at 18:18
  • @elyase: how can this code be altered to get the central sentences per cluster? – Crista23 Jun 28 '15 at 20:16
  • @Crista23, it is not directly possible. First sentences are converted to numeric vectors (Bag of Words representation) and then clustered but this transformation does not preserve the word order (among other issues) so you can't go back from central vector to sentence. You have to get creative to get 'something' back from the centroid. – elyase Jul 01 '15 at 12:51
  • Not clear how to clustering sentences instead of words in this case. The words clustering works fine in this example, but sentences clustering not. – Timur Nurlygayanov Jan 23 '19 at 07:56
  • @elyase ,how do i store the results?,mydict={} for k in range(2,10): kmeans = KMeans(n_clusters = k, max_iter = 300).fit(x) labels = kmeans.labels_ label_df = pd.DataFrame(labels.tolist(),columns=['class']) new_df = pd.concat((harsh,label_df),axis = 1) #new_df.to_csv("result{}.csv".format(k)) order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1] for i in range(k): for ind in order_centroids[i,:12]: mydict.update({i:(terms[ind])}) – krits Apr 02 '19 at 09:44