5

I am new to both python and scikit-learn so please bear with me.

I took this source code for k means clustering algorithm from k means clustering.

I then modified to run on my local set by using load_file function.

Although the algorithm terminates, but it does not produce any output like which documents are clustered together.

I found that the km object has "km.label" array which lists the centroid id of each document.

It also has the centroid vector with "km.cluster_centers_"

But what document it is ? I have to map it to "dataset" which is a "Bunch" object.

If i print dataset.data[0], i get the data of first file which i think are shuffled. but i just want to know the name.

I am confused with questions like Does the document at dataset.data[0] is clusterd to centoid at km.label[0] ?

My basic problem is to find which files are clustered together. How to find that ?

Community
  • 1
  • 1
Ashish Negi
  • 5,193
  • 8
  • 51
  • 95
  • Make sure to validate that the results are sensible. K-means will often return results that may be mathematical optimas, but not at all useful for the actual problem at hand! – Has QUIT--Anony-Mousse Jul 22 '13 at 19:55

2 Answers2

12

Forget about the Bunch object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.

In real life, with you real data you just have to call directly:

km = KMeans(n_clusters).fit(my_document_features)

then collect cluster assignments from:

km.labels_

my_document_features is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features).

km.labels_ is a 1D numpy array with shape (n_documents,). Hence the first element in labels_ is the index of the cluster of the document described in the first row of the my_document_features feature matrix.

Typically you would build my_document_features with a TfidfVectorizer object:

my_document_features = TfidfVectorizer().fit_transform(my_text_documents)

and my_text_documents would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:

vec = TfidfVectorizer(input='filename')
my_document_features = vec.fit_transform(my_text_files)

where my_text_files is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).

The length of the my_text_files or my_text_documents lists should be n_documents hence the mapping with km.labels_ is direct.

As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples instead of n_documents to document the expected shapes of the arguments and attributes of all the estimator in the library.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thank ogrisel. In the sample code dataset.data is your "my_text_files". So how do i find which file is at 0 index ? If it does not store it how can i get the file names while using load_files() function ? – Ashish Negi Jul 23 '13 at 04:16
  • 2
    No: `dataset.data` is a list of in-memory python unicode string with the content of the text files. If you use the `load_files()` utility the `filenames` are stored in the `dataset.filenames` list. – ogrisel Jul 23 '13 at 09:17
2

dataset.filenames is the key :)

This is how i did it.

load_files declaration is :

def load_files(container_path, description=None, categories=None,
           load_content=True, shuffle=True, charset=None,
           charse_error='strict', random_state=0)

so do

dataset_files = load_files("path_to_directory_containing_category_folders");

then when i got the result :

i put them in the clusters which is a dictionary

clusters = defaultdict(list)

k = 0;
for i in km.labels_ :
  clusters[i].append(dataset_files.filenames[k])  
  k += 1

and then i print it :)

for clust in clusters :
  print "\n************************\n"
  for filename in clusters[clust] :
    print filename
Ashish Negi
  • 5,193
  • 8
  • 51
  • 95
  • This helped me figure out how to display which observations belonged to each cluster for a non-file dataset. Thanks! – Jo Douglass Dec 02 '14 at 14:43