After clustering a dataset and then transforming the data to the distance from the centroids using sklearn.cluster.KMeans, is it possible to reverse the transformation, given the centroids, getting back the original features?
1 Answers
No, it's not possible. Any dimensionality reduction technique in general is a lossy operation. If you discard some dimensions there is no way getting this information back. In general, i.e. for some of possible data sets. There may exist some data sets in which some information is redundant - if a particular dimensionality reduction technique will be able to exploit just this perfectly, then a perfect inverse transformation will be possible.
In the picture below I drew a simple example. You can project many different configurations of points from a 3D space to the same point configuration in 2D space. So given only the 2D space there is no way to guess from which 3D configuration these points came from. You don't know the values of their z coordinate and there is an infinite number of possibilities.

- 34,786
- 15
- 102
- 130
-
Nice drawing! I think it may be not as bad for KMeans, since the transformation outputs distances, making the reconstruction lie on an intersection of (compact) spheres. So distances are somewhat better encoded (two points with the same distance `R` to a centroid cannot lie infinitely far apart, the maximal distance is `2R`) – eickenberg Jun 17 '14 at 23:10
-
1@eickenberg Yes, definitely it's not that bad as just forgetting some coordinates, but still it's a loss of information. And you are right, that you may be able in some circumstances to reverse the transformation approximately (I'm guessing that at least: large number of centroids). – BartoszKP Jun 17 '14 at 23:52
-
1Possibly, but I am not sure. I am not sure what the original intentions of the poster are, but some of the `sklearn` estimators are equipped with an `inverse_transform` method that does the most reasonable approximation of an inverse transform, if you understand the information loss incurred by the method. It doesn't seem to exist for `KMeans`, but it could output e.g. the closest cluster center for each sample. – eickenberg Jun 18 '14 at 07:17
-
2The closest cluster centers are already given by `km.cluster_centers_[km.predict(X)]`. I wouldn't find that a good `inverse_transform` because it ignores all but one dimension of its input. – Fred Foo Jun 18 '14 at 09:44
-
1It wouldn't be great, but e.g. `WardAgglomeration` does exactly this (fill clusters with the cluster mean) and this actually amounts to the exact pseudoinverse of the transformation. With KMeans it is evidently non-linear and stranger. Taking another look at this answer, I disagree with the statement that any dimensionality reduction technique is necessarily a lossy operation. If a series of megapixel images has two degrees of freedom, then its inherent dimension is 2 and certain dimensionality reduction techniques will permit perfect recovery. – eickenberg Jun 18 '14 at 21:40
-
@eickenberg You are right, I missed the fact that there might be redundancy in the data (but! - that's why I wrote "in general" - i.e. for any possible data set). I'll fix that. – BartoszKP Jun 18 '14 at 21:46