Drop level from one-hot-encoded column in Spark

Question

If I already have a column created by OneHotEncoderEstimator how can I drop one of the levels on the fly?

Say you have a column with 4 levels (one dropped for dependence) and you want to drop the 2nd level (i.e., put it in the intercept).

So to go from something like

row, fruit , encoded
1  , apple , [1, 0, 0]
2  , orange, [0, 1 ,0]
3  , pear  , [0, 0, 1]

to

row, fruit , encoded
1  , apple , [1, 0]
2  , orange, [0, 1]
3  , pear  , [0, 0]

One of the challenges is that the OneHotEncoderEstimator returns a SparseVector for every row. I'm not even sure how to drop the 'right' index of the vector since all I have is the column name and the level.

I know I could just remove the rows and re-encode but I'm trying to avoid that.

Does anyone know how to do this in Python/Spark 2.3?

EDIT

So I wanted to clarify that the 'encoded' column is a sparse matrix (or alternatively, a column of SparseVector objects).

See: https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector

Some of the answers here discuss the differences between sparse and dense vectors in Spark:

Sparse Vector vs Dense Vector

Yea so in this case let's say I know I want to drop the 'pear' level from the 'fruit' column, but only in the context of the encoded column. Meaning I still want the Dataframe to hold all the rows, I just want to drop this level/index from the one-hot-encoded column. — moefasa, Dec 16 '18 at 17:41
Onehotencoder has a parameter dropLast, use the setter method to turn this on/off as you need — sramalingam24, Dec 16 '18 at 18:24
@sramalingam24 thanks for the comment but in this case I want to see what the cost of dropping levels 'on the fly' would be (after the column has been encoded). This lets me encode once, and drop levels as I need (it just works better in my pipeline this way). — moefasa, Jan 09 '19 at 03:51

Drop level from one-hot-encoded column in Spark

0 Answers0