Repetition of raw dataset after clustering

Question

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.08, max_features=200,
                                 min_df=0.02, stop_words='english',
                                 use_idf=True, ngram_range=(1,3),tokenizer = tokenize_only_subject, analyzer='word')

tfidf_vectorizer.fit(enron_data["headers.Subject"])


tfidf_matrix_subject = tfidf_vectorizer.fit_transform(enron_data["headers.Subject"])


print "\n\nshape of tfidf :\t",(tfidf_matrix_subject.shape)

terms_subject = tfidf_vectorizer.get_feature_names()
print "\n Feature's selected by machine from tdifd for Subject :\t",terms_subject

x =tfidf_matrix_subject.toarray()
#
#######################################################################################
from sklearn.metrics.pairwise import cosine_similarity
distance = 1 - cosine_similarity(tfidf_matrix_subject)

print "+++distance\t:",distance[:5]


from sklearn.cluster import KMeans

num_clusters = 4

km = KMeans(n_clusters=num_clusters)


print ":",km.fit_transform(tfidf_matrix_subject).shape


centroids = km.cluster_centers_
labels = km.labels_

print "Centroid is:\t",centroids
print "Labels is :\t",labels
n_clusters_ = km.labels_
print "++++++++++++++++++++++++++++++++++++++++++++++\n",n_clusters_
enron_cls = { 'enron_data_body': enron_data["body"],'enron_data_Subject': enron_data["headers.Subject"],'_id_':enron_data["_id"],"Date":enron_data["Date"],'cluster_': n_clusters_}

frame = pd.DataFrame(enron_cls, index = [n_clusters_] , columns = ['_id_','enron_data_body','enron_data_Subject','Date','cluster_'])

print frame.head()
frame.to_csv("errror.csv")

I need guidance or help over the clustering. It is giving repeat values; for example, the fourth row of the raw dataset repeats as many times as the dataset count with cluster. I want to cluster every row, not a repetition of the raw dataset.

I don't really understand your question, especially the repetition part. Can you elaborate? — patrick, Jun 19 '16 at 14:03
Thank Brian, it's great honor for me if i get your guidance and help. — Jeet Dadhich, Jun 20 '16 at 08:25
looking for this output after kmean clustering over pandas dataframe row text cluster 1 jd all well 0 2 come here 0 3 going to pub 0 4 working on data 1 5 show your work 1 6 very bad sound 0 7 Nice time with you 0 8 all is well 0 9 great work done 1 10 awesome pitcure 0 And I am getting this output, don't know where i am missing logic row text cluster 1 jd all well 0 2 come here 1 3 jd all well 0 4 come here 1 5 jd all well 0 6 come here 1 7 jd all well 0 8 come here 1 9 jd all well 0 10 come here 1 -- this kind of repetition i am getting — Jeet Dadhich, Jun 20 '16 at 08:34

Repetition of raw dataset after clustering

0 Answers0