Convert Array form (as String) to Column in Pyspark

Question

I have a df with this form:

+---------------------------------------+
|ID|ESTRUC_COMP                         |
+---------------------------------------+
|4A|{'AP': '201', 'BQ': '2'}            |
|8B| {'AP': '501', 'BQ': '1', 'IN': '5'}|
+---------------------------------------+

And I need something like this:

+------------------------------------------------+
|ID|ESTRUC_COMP                        |AP |BQ|IN|
+------------------------------------------------+
|4A|{'AP': '201', 'BQ': '2'}           |201|2 |  |
|8B|{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 |
+------------------------------------------------+

But, ESTRUC_COMP is a String.

root
 |-- ID: string (nullable = true)
 |-- ESTRUC_COMP: string (nullable = true)

How can I perform this transformation? Thank you in advance.

Boris

Maybe it is possible to transform the column to struct type and then apply the solution in https://stackoverflow.com/questions/47874037/convert-pyspark-dataframe-column-from-array-to-new-columns?rq=1 — Boris Perez, Jul 30 '18 at 16:07
Dont' get it... I dont have a dict in column, I have a String column. I need to transforme it to dict or try to do the work using String — Boris Perez, Jul 30 '18 at 21:09

pault · Accepted Answer · 2018-07-30T22:13:58.103

Since you're using Spark 1.6, you can't use pyspark.sql.functions.from_json() - you're going to have to use a udf.

This question is very similar to PySpark “explode” dict in column, but I accept it's not a dupe for 2 reasons:

Your string column is not valid JSON (because of single quotes)
You want the keys to become columns

Nevertheless, the first step is to basically follow the same steps in the linked post with a minor tweak to the parse() function which replaces single quotes with double quotes:

from pyspark.sql.functions import udf, explode, first
from pyspark.sql.types import *
import json

def parse(s):
    try:
        return json.loads(s.replace("'", '"'))
    except json.JSONDecodeError:
        pass

parse_udf = udf(parse, MapType(StringType(), StringType()))

Now you can parse the string and call pyspark.sql.functions.explode():

df.select("ID",  explode(parse_udf("ESTRUC_COMP"))).show()
#+---+---+-----+
#| ID|key|value|
#+---+---+-----+
#| 4A| BQ|    2|
#| 4A| AP|  201|
#| 8B| IN|    5|
#| 8B| BQ|    1|
#| 8B| AP|  501|
#+---+---+-----+

Finally, pivot() to get the keys as the columns. You can use first() as the aggregate function because we know that the key-value relationship is one-to-one for each ID.

df.select("*", explode(parse_udf("ESTRUC_COMP")))\
    .groupBy("ID","ESTRUC_COMP").pivot("key").agg(first("value")).show(truncate=False)
#+---+-----------------------------------+---+---+----+
#|ID |ESTRUC_COMP                        |AP |BQ |IN  |
#+---+-----------------------------------+---+---+----+
#|4A |{'AP': '201', 'BQ': '2'}           |201|2  |null|
#|8B |{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1  |5   |
#+---+-----------------------------------+---+---+----+

Of course, since I defined the udf to return MapType(StringType(), StringType()), all of your resultant columns are going to be strings. You can either cast them or modify the udf accordingly.

Convert Array form (as String) to Column in Pyspark

1 Answers1