There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark?
Thank you
There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark?
Thank you
Well, technically it is possible. Spark, including PySpark, provides a number of transformers which we can be used to encode categorical data. In particular you should take a look at the ml.feature.StringIndexer
and OneHotEncoder
.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["label", "feature"])
stringIndexer = StringIndexer(inputCol="feature", outputCol="indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="indexed", outputCol="encoded")
encoded = encoder.transform(indexed)
So far so good. Problem is that categorical variables are not very useful in case of k-means. It assumes Euclidean norm which, even after encoding, is rather meaningless for categorical data.