Basically this is my problem: I'm trying to implement a LSH solution with spark, but to do that I need that both Training and Test have the same columns in the same order.
Second problem, I have different unique values in training and test.
For example one column in the train can have 3 unique values
col
a
b
c
and when I apply OHE
a b c
1 0 0
0 1 0
0 0 1
But my Test can be for example
col
a
e
f
so applying OHE
a e f
1 0 0
0 1 0
0 0 1
But what I would like to see in the test is
a b c
1 0 0
0 0 0
0 0 0
How can I achieve this in spark? I don't care about the unique values present in the test dataset, but I care about what is in the training and not in the test.