VectorIndexer or OneHotEncoder for categorical variables?

Question

Slightly confused on the usage of VectorIndexer or OneHotEncoder , when dealing with categorical variables as input to ML algorithms in Spark. Is it that when I need to know the effect of each categorical level in the ML output , I need to use OneHotEncoder and in the other cases VectorIndexer can be used ?

Example is as shown :

from pyspark.ml.feature import OneHotEncoder, VectorAssembler , VectorIndexer

df = sqlContext.createDataFrame([
    (0.0, 3.0, 3.8),
    (1.0, 0.0, 6.7),
    (2.0, 3.0, 3.3),
    (0.0, 2.0, 1.2),
    (0.0, 1.0, 7.8),
    (2.0, 0.0, 4.4)
], ["category1", "category2","readings"])

encoder = OneHotEncoder(dropLast = True, inputCols=["category1", "category2"],
                        outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()


+---------+---------+--------+-------------+-------------+
|category1|category2|readings| categoryVec1| categoryVec2|
+---------+---------+--------+-------------+-------------+
|      0.0|      3.0|     3.8|(2,[0],[1.0])|    (3,[],[])|
|      1.0|      0.0|     6.7|(2,[1],[1.0])|(3,[0],[1.0])|
|      2.0|      3.0|     3.3|    (2,[],[])|    (3,[],[])|
|      0.0|      2.0|     1.2|(2,[0],[1.0])|(3,[2],[1.0])|
|      0.0|      1.0|     7.8|(2,[0],[1.0])|(3,[1],[1.0])|
|      2.0|      0.0|     4.4|    (2,[],[])|(3,[0],[1.0])|
+---------+---------+--------+-------------+-------------+


va = VectorAssembler(inputCols = df.columns , outputCol = 'features')
assembled = va.transform(df)
idx = VectorIndexer(inputCol = 'features', outputCol = 'features_indexed', maxCategories = 4)
idx_model = idx.fit(assembled)
transformed = idx_model.transform(assembled)
transformed.show()

+---------+---------+--------+-------------+----------------+
|category1|category2|readings|     features|features_indexed|
+---------+---------+--------+-------------+----------------+
|      0.0|      3.0|     3.8|[0.0,3.0,3.8]|   [0.0,3.0,3.8]|
|      1.0|      0.0|     6.7|[1.0,0.0,6.7]|   [1.0,0.0,6.7]|
|      2.0|      3.0|     3.3|[2.0,3.0,3.3]|   [2.0,3.0,3.3]|
|      0.0|      2.0|     1.2|[0.0,2.0,1.2]|   [0.0,2.0,1.2]|
|      0.0|      1.0|     7.8|[0.0,1.0,7.8]|   [0.0,1.0,7.8]|
|      2.0|      0.0|     4.4|[2.0,0.0,4.4]|   [2.0,0.0,4.4]|
+---------+---------+--------+-------------+----------------+

idx_model.categoryMaps

{0: {0.0: 0, 1.0: 1, 2.0: 2}, 1: {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3}}

score 2 · Answer 1 · answered Jul 06 '21 at 18:09

To my understanding, OneHotEncoder applies only to numerical columns. If your categorical variable is StringType, then you need to pass it through StringIndexer first before you can apply OneHotEncoder.
StringIndexer transforms the labels into numbers, then OneHotEncoder creates the coded column for each value.
The way Spark outputs results of OneHotEncoder is unintuitive, the docs says in Notes section:

This is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.

If your categorical column is a Vector or an Array of Strings, then you would use VectorIndexer, then OneHotEncoder. Specifically, you can use VectorIndexer on your "features" column. Here's a similar question.

You need to fill in the nulls first in your categorical columns.
In PySpark, that's df.na.fill("value", subset=["col1","col2",...]).
In Scala, that's df.na.fill("value", Seq("col1","col2",...))

Here's the full application example,

dummydata= [
  (1,"John","B.A.",20,"Male"),
  (2,"Martha","B.Com.",None,"Female"),
  (3,"Mona","B.Com.",21,"Female"),
  (4,"Harish","B.Sc.",22,"Male"),
  (5,"Sam",None,35,"Male"),
  (6,"Jonny","B.A.",22,"Male"),
  (7,"Maria","B.A.",None,"Female"),
  (8,None,"B.A.",25,"Male"),
  (9,"Monalisa","B.A.",21,"Female")
]

toydf= spark.createDataFrame(data = dummydata, schema = ["id", "name", "qualification", "age", "gender"])

toydf.show()
+---+--------+-------------+----+------+
| id|    name|qualification| age|gender|
+---+--------+-------------+----+------+
|  1|    John|         B.A.|  20|  Male|
|  2|  Martha|       B.Com.|null|Female|
|  3|    Mona|       B.Com.|  21|Female|
|  4|  Harish|        B.Sc.|  22|  Male|
|  5|     Sam|         null|  35|  Male|
|  6|   Jonny|         B.A.|  22|  Male|
|  7|   Maria|         B.A.|null|Female|
|  8|    null|         B.A.|  25|  Male|
|  9|Monalisa|         B.A.|  21|Female|
+---+--------+-------------+----+------+

toydf= toydf\
.na.fill("NA", subset=["name","qualification"])\

toydf.show()
+---+--------+-------------+----+------+
| id|    name|qualification| age|gender|
+---+--------+-------------+----+------+
|  1|    John|         B.A.|  20|  Male|
|  2|  Martha|       B.Com.|null|Female|
|  3|    Mona|       B.Com.|  21|Female|
|  4|  Harish|        B.Sc.|  22|  Male|
|  5|     Sam|           NA|  35|  Male|
|  6|   Jonny|         B.A.|  22|  Male|
|  7|   Maria|         B.A.|null|Female|
|  8|      NA|         B.A.|  25|  Male|
|  9|Monalisa|         B.A.|  21|Female|
+---+--------+-------------+----+------+

from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer, VectorIndexer

indexer_1= StringIndexer(inputCols= ["qualification"], outputCols=["qual_index"], handleInvalid='keep', stringOrderType='frequencyDesc')

ohe_1= OneHotEncoder(inputCols=["qual_index"], outputCols=["qual_coded"], handleInvalid='keep',dropLast=True)

toydf= indexer_1.fit(toydf).transform(toydf)
toydf= ohe_1.fit(toydf).transform(toydf)

toydf.show()
+---+--------+-------------+----+------+----------+-------------+
| id|    name|qualification| age|gender|qual_index|   qual_coded|
+---+--------+-------------+----+------+----------+-------------+
|  1|    John|         B.A.|  20|  Male|       0.0|(5,[0],[1.0])|
|  2|  Martha|       B.Com.|null|Female|       1.0|(5,[1],[1.0])|
|  3|    Mona|       B.Com.|  21|Female|       1.0|(5,[1],[1.0])|
|  4|  Harish|        B.Sc.|  22|  Male|       2.0|(5,[2],[1.0])|
|  5|     Sam|           NA|  35|  Male|       3.0|(5,[3],[1.0])|
|  6|   Jonny|         B.A.|  22|  Male|       0.0|(5,[0],[1.0])|
|  7|   Maria|         B.A.|null|Female|       0.0|(5,[0],[1.0])|
|  8|      NA|         B.A.|  25|  Male|       0.0|(5,[0],[1.0])|
|  9|Monalisa|         B.A.|  21|Female|       0.0|(5,[0],[1.0])|
+---+--------+-------------+----+------+----------+-------------+

VectorIndexer or OneHotEncoder for categorical variables?

1 Answers1