How to revert One-Hot Enoding in Spark (Scala)

Question

After running k-means (mllib spark scala) I want to make sense of the cluster centers I obtained from data which I pre-processed using (among other transformers) mllib's OneHotEncoder.

A center looks like this:

Cluster Center 0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]

Which is obviously not very human friendly... Any ideas on how to revert the one-hot encoding and retrieve the original categorical features? What if I look for the data point which is closest (using the same distance metric that is used by k-means, which I assume is Euclidean distance) to the centroid and then revert the encoding of that particular data point?

Steffen Schmitz · Accepted Answer · 2017-05-25T10:02:15.750

1

For the cluster centroids it is not possible (strongly disrecommended) to reverse the encoding. Imagine you have the original feature "3" out of 6 and it is encoded as [0.0,0.0,1.0,0.0,0.0,0.0]. In this case it's easy to extract 3 as the correct feature from the encoding.

But after kmeans application you may get a cluster centroid that looks for this feature like this [0.0,0.13,0.0,0.77,0.1,0.0]. If you want to decode this back to the representation that you had before, like "4" out of 6, because the feature 4 has the largest value, then you will lose information and the model may get corrupted.

Edit: Add a possible way to revert encoding on datapoints from the comments to the answer

If you have IDs on the datapoints you can perform a select / join operation on the ID after you assigned a datapoints to a cluster to get the old state, before the encoding.

edited May 25 '17 at 10:02

answered May 24 '17 at 14:01

Steffen Schmitz

860
3
16
34

Thanks! I understand your answer. What if I look for the data point which is closest (using the same distance metric that is used by k-means, which I assume is Euclidean distance) to the centroid and then revert the encoding of that particular data point? – João Moura May 24 '17 at 14:17
1

@JoãoMoura Then I think the easiest thing would be to have ID's on each datapoint and you retrieve the original value by ID after you assigned a point to its cluster. Then you do not need to revert the encoding, but perform a simple select / join operation on the original and the encoded dataset. – Steffen Schmitz May 24 '17 at 14:35

How to revert One-Hot Enoding in Spark (Scala)

1 Answers1