Multiindex categorization and encoding this in PySpark

Question

I have multiindex mapping rules, here's the rules

Type A: Chicken, Beef, Goat
Type B: Fish, Shrimp
Type C: Chicken, Pork

I here's my dataframe, let say this is a df dataframe, and want to do multi index mapping

   id   Menu
   1    Fried Chicken
   2    Shrimp Chips
   3    Pork with Cheese
   4    Fish Spaghetti
   5    Goat Sate
   6    Beef Soup

With pandas, it can be use pd.MultiIndex and pd.Series.str.get_dummies. So, the code will be

from pandas import pd
from numpy.core.defchararray import find

m = {
    'Type A': ['Chicken', 'Beef', 'Goat'],
    'Type B': ['Fish', 'Shrimp'],
    'Type C': ['Chicken', 'Pork']
}

mux = pd.MultiIndex.from_tuples(
    [(k, v) for k, values in m.items() for v in values])

df.join(
    df.Menu.str.get_dummies(sep=' ') \
      .reindex(columns=mux, level=1).max(axis=1
, level=0)
)

The output would be like this

 id     Menu                 Type A   Type B   Type C
   1    Fried Chicken        1        0        1
   2    Shrimp Chips         0        1        0
   3    Pork with Cheese     0        0        1
   4    Fish Spaghetti       0        1        0
   5    Goat Sate            1        0        0
   6    Beef Soup            1        0        0

How to do this on pySpark datafarame

Pierre Gourseaud · Accepted Answer · 2018-08-13T14:48:47.050

1

There is no standard multi-indexing in spark.

What you can do is check if your string contains any of your elements :

from pyspark.sql import functions as F

index_dict = {
    'Type A': ['Chicken', 'Beef', 'Goat'],
    'Type B': ['Fish', 'Shrimp'],
    'Type C': ['Chicken', 'Pork']
}

for col_name, values in index_dict.items():
    col = F.col('Menu').like('%'+values[0]+'%')
    for value in values[1:]:
        col2 = F.col('Menu').like('%'+value+'%')
        col = col | col2
    df = df.withColumn(col_name, col)

I think it could be faster if you don't check the whole string when you know the separator. You could therefore use :

df = df.withColumn('tmp', F.split(df['Menu'], ' '))

Replace F.col('Menu').like('%'...'%') with F.array_contains(F.col('tmp'), ...)

And finish with df = df.drop('tmp')

edited Aug 13 '18 at 14:48

answered Aug 13 '18 at 14:41

Pierre Gourseaud

2,347
13
24

Error on the last line `AttributeError: 'PipelinedRDD' object has no attribute 'withColumn'` – Nabih Bawazir Aug 13 '18 at 14:48
You're not using a `'DataFrame'` object but a `'PipelinedRDD'`. Can you convert it to a Spark DataFrame ? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame – Pierre Gourseaud Aug 13 '18 at 14:50
It works, your output in boolean, not in number header `id Menu Type A Type B Type C` and frist column `1 Fried Chicken true false true` But, you deserve best answer – Nabih Bawazir Aug 13 '18 at 15:25
Yes, which was useful to my mind. Then do `from pyspark.sql.types import IntegerType` and change `df = df.withColumn(col_name, col)` to `df = df.withColumn(col_name, col.cast(IntegerType))` – Pierre Gourseaud Aug 13 '18 at 15:28
already try las comment, the error `unexpected type: `, I guess it better to ask in new question, to show all the error, but now is 10.30 pm in Indonesia, will ask tomorrow – Nabih Bawazir Aug 13 '18 at 15:34
Yes indeed, it's my fault : it must be `df = df.withColumn(col_name, col.cast(IntegerType()))` (an instance of the class) – Pierre Gourseaud Aug 13 '18 at 15:36
The last output is only cast 1 column only – Nabih Bawazir Aug 14 '18 at 09:21
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177999/discussion-between-arius-and-nabih-bawazir). – Pierre Gourseaud Aug 14 '18 at 10:07

Multiindex categorization and encoding this in PySpark

1 Answers1