Nested JSON parsing pyspark

Question

I have a nested json body like below:

{ "MainTag": { "GroupId": "10C81", "IDArray": [ "ABC-XYZ-123" ], "IDStruct": { "DSA-ASA-211": null, "BSA-ASA-211": null, "ABC-XYZ-123": [ { "BagId": "42425fsdfs", "TravelerId": "1234567", "LegCorr": [ { "DelID": "SQH", "SegID": "PQR-UVW" }, { "DelID": "GFS", "SegID": "GHS-UVW" } ] } ] } } }

My desired output is:

Note: The IDStruct tag will only have value for the keys present in array tag IDArray.

Can someone please help me here in achieving this using PySpark.

Below is the printSchema() output:

Is "IDArray" IDs the same as the name of "IDStruct" field which you need most (which is not null)? — ZygD, Oct 04 '22 at 09:51

score 0 · Answer 1 · answered Oct 04 '22 at 09:24

Basically you can do it using function explode. First, I will create the structure for our data. Then create a data frame and explode all the arrays between.

data = { "MainTag": { "GroupId": "10C81", "IDArray": [ "ABC-XYZ-123"],` "IDStruct": { "DSA-ASA-211": "null", "BSA-ASA-211": "null", "ABC-XYZ-123": [ { "BagId": "42425fsdfs", "TravelerId": "1234567", "LegCorr": [ { "DelID": "SQH", "SegID": "PQR-UVW" }, { "DelID": "GFS", "SegID": "GHS-UVW" } ] } ] } } }
    
    from pyspark.sql.types import *
    schema_main = StructType([
        StructField('MainTag', StructType([
            StructField('GroupId', StringType()),
            StructField('IDArray', ArrayType(StringType())),
            StructField('IDStruct', StructType([
                StructField('DSA-ASA-211', StringType()),
                StructField('BSA-ASA-211', StringType()),
                StructField('ABC-XYZ-123', ArrayType(
                    StructType([
                        StructField('BagId', StringType()),
                        StructField('TravelerId', StringType()),
                        StructField('LegCorr', ArrayType(
                            StructType([
                                StructField('DelID', StringType()),
                                StructField('SegID', StringType())
                            ])
                        ))
                    ])
                ))
            ]))
        ]))
    ])
    
    df = spark.createDataFrame(data=[data], schema=schema_main)
    
    from pyspark.sql.functions import explode
    df.select("MainTag", "MainTag.GroupId", explode("MainTag.IDArray").alias("ID")) \
    .select("GroupId", "ID", explode("MainTag.IDStruct.ABC-XYZ-123").alias("ABC")) \
    .select("GroupId", "ID", "ABC.BagId", "ABC.TravelerId", explode("ABC.LegCorr").alias("Legs")) \
    .select("GroupId", "ID", "BagId", "TravelerId", "Legs.DelID", "Legs.SegID") \
    .show(truncate=False)

Here is the result:

+-------+-----------+----------+----------+-----+-------+
|GroupId|ID         |BagId     |TravelerId|DelID|SegID  |
+-------+-----------+----------+----------+-----+-------+
|10C81  |ABC-XYZ-123|42425fsdfs|1234567   |SQH  |PQR-UVW|
|10C81  |ABC-XYZ-123|42425fsdfs|1234567   |GFS  |GHS-UVW|
+-------+-----------+----------+----------+-----+-------+

Thanks. But, the real issue is the keys in the tag 'IDStruct' are not static and will vary for each record, which is why i cannot hardcode these while defining the schema. Basically i cannot specify the 3 ids "DSA-ASA-211", "BSA-ASA-211", "ABC-XYZ-123" as column names. — Vaibhav, Oct 04 '22 at 09:30
@Vaibhav you may try making the IDArray of type ArrayType(StringType()) and then translate the json later using from_json function. Something like df.withColumn("IDArray", from_json(...)), but still it may work only if you have the same schema in one dataset. If each row is different, then it may not work. — hopeman, Oct 04 '22 at 09:37

ZygD · Answer 2 · 2022-10-04T13:12:26.863

It's convenient to use inline to expand arrays of struct. In the following code it's used 3 times.

This is your data:

from pyspark.sql import functions as F
import re
df = spark.createDataFrame(
    [({"GroupId":"10C81","IDArray":["ABC-XYZ-123"],"IDStruct":{"DSA-ASA-211":None,"BSA-ASA-211":None,"ABC-XYZ-123":[{"BagId":"42425fsdfs","TravelerId":"1234567","LegCorr":[{"DelID":"SQH","SegID":"PQR-UVW"},{"DelID":"GFS","SegID":"GHS-UVW"}]}]}},)],
    "MainTag struct<GroupId:string, IDArray:array<string>, IDStruct:struct<`ABC-XYZ-123`:array<struct<BagId:string,LegCorr:array<struct<DelID:string,SegID:string>>,TravelerId:string>>,`DSA-ASA-211`:string,`BSA-ASA-211`:string>>")
df.printSchema()
# root
#  |-- MainTag: struct (nullable = true)
#  |    |-- GroupId: string (nullable = true)
#  |    |-- IDArray: array (nullable = true)
#  |    |    |-- element: string (containsNull = true)
#  |    |-- IDStruct: struct (nullable = true)
#  |    |    |-- ABC-XYZ-123: array (nullable = true)
#  |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |-- BagId: string (nullable = true)
#  |    |    |    |    |-- LegCorr: array (nullable = true)
#  |    |    |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |    |    |-- DelID: string (nullable = true)
#  |    |    |    |    |    |    |-- SegID: string (nullable = true)
#  |    |    |    |    |-- TravelerId: string (nullable = true)
#  |    |    |-- DSA-ASA-211: string (nullable = true)
#  |    |    |-- BSA-ASA-211: string (nullable = true)

Script:

df = df.selectExpr("inline(array(MainTag))")
df = df.selectExpr(
    *[c for c in df.columns if c != 'IDStruct'],
    "inline(IDStruct.`ABC-XYZ-123`)")
df = df.selectExpr(
    *[c for c in df.columns if c != 'LegCorr'],
    "inline(LegCorr)")
df = df.withColumn("IDArray", F.explode("IDArray"))

df.show()
# +-------+-----------+----------+----------+-----+-------+
# |GroupId|    IDArray|     BagId|TravelerId|DelID|  SegID|
# +-------+-----------+----------+----------+-----+-------+
# |  10C81|ABC-XYZ-123|42425fsdfs|   1234567|  SQH|PQR-UVW|
# |  10C81|ABC-XYZ-123|42425fsdfs|   1234567|  GFS|GHS-UVW|
# +-------+-----------+----------+----------+-----+-------+

If you don't want to hard-code "ABC-XYZ-123", one of possible ways is to change the schema for "MainTag" column. Something like the following could do it.

re.sub(r"([\w\-]+)(?=:array<struct<BagId)", "_ids", maintag_schema) - hopefully should rename that column to "_ids".
re.sub(r"(?<=,|<)([^,<]+)(?=:)", r"`\1`", maintag_schema) - hopefully should add ` around column/ field names which is needed for some field names.

maintag_schema = df.schema["MainTag"].dataType.simpleString()
maintag_schema = re.sub(r"([\w\-]+)(?=:array<struct<BagId)", "_ids", maintag_schema)
maintag_schema = re.sub(r"(?<=,|<)([^,<]+)(?=:)", r"`\1`", maintag_schema)

df = df.withColumn("MainTag", F.col("MainTag").cast(maintag_schema))

df = df.selectExpr("inline(array(MainTag))")
df = df.selectExpr(
    *[c for c in df.columns if c != 'IDStruct'],
    "inline(IDStruct._ids)")
df = df.selectExpr(
    *[c for c in df.columns if c != 'LegCorr'],
    "inline(LegCorr)")
df = df.withColumn("IDArray", F.explode("IDArray"))

df.show()
# +-------+-----------+----------+----------+-----+-------+
# |GroupId|    IDArray|     BagId|TravelerId|DelID|  SegID|
# +-------+-----------+----------+----------+-----+-------+
# |  10C81|ABC-XYZ-123|42425fsdfs|   1234567|  SQH|PQR-UVW|
# |  10C81|ABC-XYZ-123|42425fsdfs|   1234567|  GFS|GHS-UVW|
# +-------+-----------+----------+----------+-----+-------+

In the below code can we somehow dynamically pass the hardcoded value 'ABC-XYZ-123' as it will vary for every record. df = df.selectExpr( *[c for c in df.columns if c != 'IDStruct'], "inline(IDStruct.`ABC-XYZ-123`)") — Vaibhav, Oct 04 '22 at 10:28

Nested JSON parsing pyspark

2 Answers2