0

From a dataset in which I am using PCA and kmeans, I would like to know what are the central objects in each cluster.

What is the best way to describe these objects as iris from my original dataset ?

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3).fit(X_pca)


# I can get the central object from the reduced data but this does not help me describe 
# the properties of the center of each cluster
from sklearn.metrics import pairwise_distances_argmin_min
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X_pca)
for i in closest:
    print X_pca[i]
Michael
  • 2,436
  • 1
  • 36
  • 57

2 Answers2

4

There are two ways to do what you ask.

You can get the nearest approximation of the centers in the original feature space using PCA's inverse transform:

centers = pca.inverse_transform(kmeans.cluster_centers_)
print(centers)

[[ 6.82271303  3.13575974  5.47894833  1.91897312]
 [ 5.80425955  2.67855286  4.4229187   1.47741067]
 [ 5.03012829  3.42665848  1.46277424  0.23661913]]

Or, you can recompute the mean in the original space using the original data and the cluster labels:

for label in range(kmeans.n_clusters):
    print(X[kmeans.labels_ == label].mean(0))

[ 6.8372093   3.12093023  5.4627907   1.93953488]
[ 5.80517241  2.67758621  4.43103448  1.45689655]
[ 5.01632653  3.44081633  1.46734694  0.24285714]

Even though the resulting centers are not in the original dataset, you can treat them as if they are! For example, if you're clustering images, the resulting centers can be viewed as images to get insight into the clustering. Alternatively, you can do a nearest-neighbor search on these results to recover the original data point that most closely approximates the center.

Keep in mind, though, that PCA is lossy and KMeans is fast, and so it's probably going to be more useful to run KMeans on the full, unprojected data:

print(KMeans(3).fit(X).cluster_centers_)

[[ 6.85        3.07368421  5.74210526  2.07105263]
 [ 5.9016129   2.7483871   4.39354839  1.43387097]
 [ 5.006       3.418       1.464       0.244     ]]

In this simple case, all three methods produce very similar results.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
0

I'm sorry if this is not exactly the answer, but why are you using PCA at all? You are reducing data from four to two dimensions, which is one-way operation: you won't get back all four parameters from two, and you may also slightly compromise distance estimations (therefore clustering). On the other hand, if you use k-means on raw data, cluster centers will be described by the same property list as original items.

Synedraacus
  • 975
  • 1
  • 8
  • 21
  • I agree however I used this example as a toy example to express the idea of what I'm trying to do. In reality I have a larger dataset with many features that I reduced with PCA. – Michael Nov 24 '15 at 08:47
  • PCA, and, generally speaking, any other dimensionality reduction method is [not lossless](http://arxiv.org/pdf/1204.0429.pdf). It means that if you want to get original dimensions -- put original dimensions in. – Synedraacus Nov 24 '15 at 08:52
  • Sorry, didn't end editing in time, so a separate comment. For instance, take the cluster returned by k-means and [recompute center point](http://stackoverflow.com/questions/1253801/finding-the-center-of-a-cluster) for it in original dimensionality. – Synedraacus Nov 24 '15 at 09:01
  • 1
    If you first apply PCA, as @Synedraacus points, is not losless. Thus, the dimensions that are dropped out boy PCA cannot be "reconstructed" afterwards. What you will achieve essentially if you apply first PCA and then k-means is that you will just perform clustering on a smaller set of dimensions. There is no point to "compare" these centroids with the original ones because you cannot compare things that have different number of dimensions. This is only the theoretical part. In practice, PCA is applied in many situations as a first "clean-up" step before applying the desired algorithm. – rpd Nov 24 '15 at 10:18
  • thanks @rpd this is helpful! So in practice if I want to get some insight about my clusters (what categorise the cluster) I should just run k-means and interpret the centroid back to the original dataset but for production use / better efficiency, i should apply PCA to lower down the number of dimension than run k-means. Basically my clusters will be the same if my PCA does not lost too many information. am I right ? – Michael Nov 24 '15 at 10:50
  • 2
    Exactly! Note that if you have many features (order of 100s), by applying PCA you do not only reduce dimensionality for efficiency issues but also to be able to better interpret the results, because checking the relevance of 300 features is a difficult -if not impossible task- and in most of the cases only a small percentage can give you excellent accuracy. – rpd Nov 24 '15 at 11:07