One hot encoding with different unique value in Pyspark

Asked Oct 04 '21 at 23:17

Active Oct 04 '21 at 23:17

Viewed 162 times

Basically this is my problem: I'm trying to implement a LSH solution with spark, but to do that I need that both Training and Test have the same columns in the same order.

Second problem, I have different unique values in training and test.

For example one column in the train can have 3 unique values

col
a
b
c

and when I apply OHE

But my Test can be for example

col
a
e
f

so applying OHE

But what I would like to see in the test is

How can I achieve this in spark? I don't care about the unique values present in the test dataset, but I care about what is in the training and not in the test.

asked Oct 04 '21 at 23:17

ianux22

2

[`OneHotEncoder`](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html) has a parameter *handleInvalid='keep'* to handle invalid catagories for predictions. – Michael Szczesny Oct 04 '21 at 23:59

One hot encoding with different unique value in Pyspark

0 Answers0