Nested Json Using pyspark

Question

We have to build nested json using below structure in pyspark and i have added data that need to feed using this

Input Data structure

Data

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
columns = ["data","action"]

df = spark.createDataFrame(zip(a1, a2), columns)

#Input data for json structure 
a1=["Pune"]
a2=["YES"]
a3=["India"]
col=["DA_Stinf_city","DA_Stinf_NA_ID_GRANT","DA_country"]
data=spark.createDataFrame(zip(a1, a2,a3), col)

Expected result based on above data

{
    "data": {
        "studentinfo": {
            "city": "Pune",
            "name": {
                "id": {
                    "grant": "YES"
                }
            }
        },

        "country": "india"
    }
}

we have tried using F.struct function in manually but we have find dynamic way to build this json using df dataframe having data and action column

data.select(        
    F.struct(
        F.struct(
                F.struct(F.col("DA_Stinf_city")).alias("city"),
                F.struct(
                    F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
                    ).alias("name"),
        ).alias("studentinfo"),
        F.struct(F.col("DA_country")).alias("country")
    ).alias("data")
)

score 0 · Answer 1 · answered Feb 06 '23 at 13:16

The approach below should give the correct structure (with the wrong key names - if you are happy with the approach, which doesn't use DataFrame operations but rather works in the underlying RDD, then I can flesh it out):

def build_json(input, running={}):
    new_input = {}
    for hierarchy, value in input:
        key = hierarchy.pop(0)
        if len(hierarchy) == 0:
            running[key] = value
        else:
            new_input[key] = new_input.get(key, []) + [(hierarchy, value)]

    for key in new_input:
        print(new_input[key])
        running[key] = build_json(new_input[key], running={})

    return running


data.rdd.map(
    lambda x: build_json(
        [(column.split("_"), value) for column, value in x.asDict().items()]
    )
)

The basic idea is to get a set of tuples from the underlying RDD consisting of the column name broken into its json hierarchy and the value to insert into the hierarchy. Then the function build_json inserts the value into its correct place in the json hierarchy, while building out the json object recursively.

Nested Json Using pyspark

1 Answers1