8

When I use onehotencoder in Spark,I will get the result as in fourth column which is a sparse vector.

// +---+--------+-------------+-------------+
// | id|category|categoryIndex|  categoryVec|
// +---+--------+-------------+-------------+
// |  0|       a|          0.0|(3,[0],[1.0])|
// |  1|       b|          2.0|(3,[2],[1.0])|
// |  2|       c|          1.0|(3,[1],[1.0])|
// |  3|      NA|          3.0|    (3,[],[])|
// |  4|       a|          0.0|(3,[0],[1.0])|
// |  5|       c|          1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+

However, what I want is to produce 3 columns for categories just like the way it works in pandas.

>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
Mohamed Ibrahim
  • 191
  • 1
  • 12
  • 1
    Why would you want to do this? This will make your data very big and memory inefficient. – David Arenburg Mar 19 '17 at 09:07
  • It will not makes the data that big because I don't have much distinct values in my dataset. The resulted features will be 122 (122 columns). I want to do that so it is easier to process them with TensorFlow. I want to feed the data as input to a neural network. – Mohamed Ibrahim Mar 20 '17 at 19:43

2 Answers2

15

Spark's OneHotEncoder creates a sparse vector column. To create the output columns similar to pandas OneHotEncoder, we need to create a separate column for each category. We can do that with the help of pyspark dataframe's withColumn function by passing a udf as a parameter. For ex -

from pyspark.sql.functions import udf,col
from pyspark.sql.types import IntegerType


df = sqlContext.createDataFrame(sc.parallelize(
        [(0,'a'),(1,'b'),(2,'c'),(3,'d')]), ('col1','col2'))

categories = df.select('col2').distinct().rdd.flatMap(lambda x : x).collect()
categories.sort()
for category in categories:
    function = udf(lambda item: 1 if item == category else 0, IntegerType())
    new_column_name = 'col2'+'_'+category
    df = df.withColumn(new_column_name, function(col('col2')))

print df.show()

Output-

+----+----+------+------+------+------+                                         
|col1|col2|col2_a|col2_b|col2_c|col2_d|
+----+----+------+------+------+------+
|   0|   a|     1|     0|     0|     0|
|   1|   b|     0|     1|     0|     0|
|   2|   c|     0|     0|     1|     0|
|   3|   d|     0|     0|     0|     1|
+----+----+------+------+------+------+

I hope this helps.

arker296
  • 404
  • 5
  • 10
1

Cant comment because I dont have the reputation points, so answering the question instead.

This is actually one of the best things about spark pipelines and transformers! I do not understand why you would need to get it in this format. Can you elaborate?

  • Thanks for reply. Repeating my comment above: It will not makes the data that big because I don't have much distinct values in my dataset. The resulted features will be 122 (122 columns). I want to do that so it is easier to process them with TensorFlow. I want to feed the data as input to a neural network. – Mohamed Ibrahim Mar 20 '17 at 19:43