I'm working in PySpark using Spark 2.1 to prepare my data to build a logistic regression. I have several string variables in my data and I want to set the most frequent category as the reference level. I first use StringIndexer to encode the string column into label indices and I know these are ordered by label frequencies with the most frequent receiving the index of 0.
stringIndexer = StringIndexer(inputCol='income_grp', outputCol="income_grp_indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)
+-------------+------------------+
| income_grp|income_grp_indexed|
+-------------+------------------+
|200000_299999| 0.0|
|300000_499999| 1.0|
|100000_199999| 2.0|
|500000_749999| 3.0|
| less_100000| 4.0|
|750000_999999| 5.0|
| ge_1000000| 6.0|
+-------------+------------------+
Then I use OneHotEncoder to map the column of label indices to a column of binary vectors. However, I only see an option in OneHotEncoder to drop the last level, which is the least frequent category.
encoder = OneHotEncoder(dropLast=True, inputCol="income_grp_indexed", outputCol="income_grp_encoded")
encoded = encoder.transform(indexed)
+-------------+------------------+------------------+
| income_grp|income_grp_indexed|income_grp_encoded|
+-------------+------------------+------------------+
|200000_299999| 0.0| (6,[0],[1.0])|
|300000_499999| 1.0| (6,[1],[1.0])|
|100000_199999| 2.0| (6,[2],[1.0])|
|500000_749999| 3.0| (6,[3],[1.0])|
| less_100000| 4.0| (6,[4],[1.0])|
|750000_999999| 5.0| (6,[5],[1.0])|
| ge_1000000| 6.0| (6,[],[])|
+-------------+------------------+------------------+
How can I remove the most frequent category of each of my string variables?