If I already have a column created by OneHotEncoderEstimator how can I drop one of the levels on the fly?
Say you have a column with 4 levels (one dropped for dependence) and you want to drop the 2nd level (i.e., put it in the intercept).
So to go from something like
row, fruit , encoded
1 , apple , [1, 0, 0]
2 , orange, [0, 1 ,0]
3 , pear , [0, 0, 1]
to
row, fruit , encoded
1 , apple , [1, 0]
2 , orange, [0, 1]
3 , pear , [0, 0]
One of the challenges is that the OneHotEncoderEstimator returns a SparseVector for every row. I'm not even sure how to drop the 'right' index of the vector since all I have is the column name and the level.
I know I could just remove the rows and re-encode but I'm trying to avoid that.
Does anyone know how to do this in Python/Spark 2.3?
EDIT
So I wanted to clarify that the 'encoded' column is a sparse matrix (or alternatively, a column of SparseVector objects).
See: https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector
Some of the answers here discuss the differences between sparse and dense vectors in Spark: