1

There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark?

Thank you

zero323
  • 322,348
  • 103
  • 959
  • 935
SparkiTony
  • 11
  • 3
  • How many categories? Many or just just a few? – David Maust Jan 10 '16 at 00:41
  • Not really an answer, but this post on [the Data Science SE](http://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data) goes into much more detail about why categorical data and k-means are less than a perfect match. Definitely worth the read! – Nelewout Jan 10 '16 at 23:50
  • Could you either accept the answer or explain why it doesn't work for you so it can be improved? Thanks in advance. – zero323 Apr 25 '16 at 10:06

1 Answers1

1

Well, technically it is possible. Spark, including PySpark, provides a number of transformers which we can be used to encode categorical data. In particular you should take a look at the ml.feature.StringIndexer and OneHotEncoder.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["label", "feature"])
stringIndexer = StringIndexer(inputCol="feature", outputCol="indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="indexed", outputCol="encoded")
encoded = encoder.transform(indexed)

So far so good. Problem is that categorical variables are not very useful in case of k-means. It assumes Euclidean norm which, even after encoding, is rather meaningless for categorical data.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Just a technical note: one-hot encoding will blow up the dimensionality of the problem extremely quickly (especially since OP indicated that he has _a huge data file consisting of all categorical columns_), which could create a myriad of problems outlined [here](https://en.wikipedia.org/wiki/Clustering_high-dimensional_data). – Nelewout Jan 10 '16 at 23:56
  • @N.Wouda I agree, although I think it is a secondary issue compared to using a model with completely wrong assumptions. Still, feel free to edit the answer. I can convert it into community wiki if you prefer. – zero323 Jan 11 '16 at 00:22
  • I did in no way mean to discredit your answer! I was merely remarking that that would be an additional consideration the OP should take into account when taking this approach. You're most welcome to include my comment in your answer, if you feel like doing so :). – Nelewout Jan 11 '16 at 00:24
  • I know k-means is not a good approach but for now we dont have another option because of the project nature. so if I use what zero323 has suggested, then how to apply it to kmeans?!?! – SparkiTony Feb 01 '16 at 20:22