Use K-means to learn features in Python

Question

Question

I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.

How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?

Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.

Code Example

# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)

# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)

score 0 · Answer 1 · answered Oct 22 '15 at 11:31

0

The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.

The centroids are just a property of these clusters.

You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.

answered Oct 22 '15 at 11:31

toine

1,946
18
24

Thanks for the answer. I know that the k-means separates my input data into k regions. The question nevertheless was how to use the centroids to implement and understand the features learnt. So a centroid can basically be much more than just the "property" of a cluster from a feature learning perspective. – Jamona Oct 22 '15 at 11:35

score 0 · Answer 2 · answered Oct 22 '15 at 18:35

This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."

What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.

A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.

Is that the level of answer you need?

No sorry. The level is not deep enough and too vague for me. Kmeans makes sense in terms of feature learning according to lots of literature: "The above discussion has provided the basic ingredients needed to turn K-means into a simple feature learning method." (http://www.cs.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf) Lets not discuss about the significance of the learned features since that's something complete different. — Jamona, Oct 23 '15 at 11:00

score 0 · Accepted Answer · answered Dec 11 '15 at 14:01

The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).

By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.

Example:

Method: K-means with k=10

Dataset: 20 observations divided into 2 patches each = 40 data vectors

We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.

This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.

Use K-means to learn features in Python

Question

Code Example

3 Answers3