1

Basically this is my problem: I'm trying to implement a LSH solution with spark, but to do that I need that both Training and Test have the same columns in the same order.

Second problem, I have different unique values in training and test.

For example one column in the train can have 3 unique values

col
a
b
c

and when I apply OHE

a  b  c
1  0  0
0  1  0
0  0  1

But my Test can be for example

col
a
e
f

so applying OHE

a  e  f
1  0  0
0  1  0
0  0  1

But what I would like to see in the test is

a  b  c
1  0  0
0  0  0
0  0  0

How can I achieve this in spark? I don't care about the unique values present in the test dataset, but I care about what is in the training and not in the test.

ianux22
  • 405
  • 4
  • 16
  • 2
    [`OneHotEncoder`](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html) has a parameter *handleInvalid='keep'* to handle invalid catagories for predictions. – Michael Szczesny Oct 04 '21 at 23:59

0 Answers0