How to set reference levels in a Spark ML Logistic Regression using OneHotEncoder

Question

I'm working in PySpark using Spark 2.1 to prepare my data to build a logistic regression. I have several string variables in my data and I want to set the most frequent category as the reference level. I first use StringIndexer to encode the string column into label indices and I know these are ordered by label frequencies with the most frequent receiving the index of 0.

stringIndexer = StringIndexer(inputCol='income_grp', outputCol="income_grp_indexed")
model = stringIndexer.fit(df)
indexed = model.transform(df)

+-------------+------------------+
|   income_grp|income_grp_indexed|
+-------------+------------------+
|200000_299999|               0.0|
|300000_499999|               1.0|
|100000_199999|               2.0|
|500000_749999|               3.0|
|  less_100000|               4.0|
|750000_999999|               5.0|
|   ge_1000000|               6.0|
+-------------+------------------+

Then I use OneHotEncoder to map the column of label indices to a column of binary vectors. However, I only see an option in OneHotEncoder to drop the last level, which is the least frequent category.

encoder = OneHotEncoder(dropLast=True, inputCol="income_grp_indexed", outputCol="income_grp_encoded")
encoded = encoder.transform(indexed)

+-------------+------------------+------------------+
|   income_grp|income_grp_indexed|income_grp_encoded|
+-------------+------------------+------------------+
|200000_299999|               0.0|     (6,[0],[1.0])|
|300000_499999|               1.0|     (6,[1],[1.0])|
|100000_199999|               2.0|     (6,[2],[1.0])|
|500000_749999|               3.0|     (6,[3],[1.0])|
|  less_100000|               4.0|     (6,[4],[1.0])|
|750000_999999|               5.0|     (6,[5],[1.0])|
|   ge_1000000|               6.0|         (6,[],[])|
+-------------+------------------+------------------+

How can I remove the most frequent category of each of my string variables?

I'm not sure what you're asking. OneHotEncoder can "drop" a level because you only need `n-1` levels to fully describe a categorical variable with `n` levels. The choice of which level is "dropped" is quite arbitrary from a modeling perspective. — pault, May 11 '18 at 14:14
@pault - I understand that it may not matter from a modeling perspective, but for other reasons I need to control this. So my ask is how can I customize which level is dropped. — Amber Z., May 11 '18 at 14:38

score 0 · Answer 1 · answered Oct 08 '21 at 20:06

I know that it´s an old question, and my answer may not work for the version 2.1 of Spark, but for Spark 3.1.2 (which is the version that I´m using), the StringIndexer has an argument stringOrderType, which can be set to 'frequencyAsc'. If you do this, the last index will be the one with the highest frequency, and it will be dropped at the OneHotEncoder.

So you can do:

    stringIndexer = StringIndexer(inputCol='income_grp', 
    outputCol="income_grp_indexed", stringOrderType='frequencyAsc')

    # The rest is the same
    encoder = OneHotEncoder(dropLast=True, inputCol="income_grp_indexed", outputCol="income_grp_encoded")
    encoded = encoder.transform(indexed)
    ...

This is specially relevant if we are dealing with pyspark.ml.regression.GeneralizedLinearRegression, which is a model that can output the statistical p.values for each coefficient, and those p.values can change based in the frequency of the base level.

How to set reference levels in a Spark ML Logistic Regression using OneHotEncoder

1 Answers1