Since you're using Spark 1.6, you can't use pyspark.sql.functions.from_json()
- you're going to have to use a udf
.
This question is very similar to PySpark “explode” dict in column, but I accept it's not a dupe for 2 reasons:
Your string column is not valid JSON (because of single quotes)
You want the keys to become columns
Nevertheless, the first step is to basically follow the same steps in the linked post with a minor tweak to the parse()
function which replaces single quotes with double quotes:
from pyspark.sql.functions import udf, explode, first
from pyspark.sql.types import *
import json
def parse(s):
try:
return json.loads(s.replace("'", '"'))
except json.JSONDecodeError:
pass
parse_udf = udf(parse, MapType(StringType(), StringType()))
Now you can parse the string and call pyspark.sql.functions.explode()
:
df.select("ID", explode(parse_udf("ESTRUC_COMP"))).show()
#+---+---+-----+
#| ID|key|value|
#+---+---+-----+
#| 4A| BQ| 2|
#| 4A| AP| 201|
#| 8B| IN| 5|
#| 8B| BQ| 1|
#| 8B| AP| 501|
#+---+---+-----+
Finally, pivot()
to get the keys as the columns. You can use first()
as the aggregate function because we know that the key-value relationship is one-to-one for each ID
.
df.select("*", explode(parse_udf("ESTRUC_COMP")))\
.groupBy("ID","ESTRUC_COMP").pivot("key").agg(first("value")).show(truncate=False)
#+---+-----------------------------------+---+---+----+
#|ID |ESTRUC_COMP |AP |BQ |IN |
#+---+-----------------------------------+---+---+----+
#|4A |{'AP': '201', 'BQ': '2'} |201|2 |null|
#|8B |{'AP': '501', 'BQ': '1', 'IN': '5'}|501|1 |5 |
#+---+-----------------------------------+---+---+----+
Of course, since I defined the udf
to return MapType(StringType(), StringType())
, all of your resultant columns are going to be strings. You can either cast
them or modify the udf
accordingly.