How to flatten complex nested json file using pyspark

Asked Feb 19 '23 at 07:22

Active Feb 19 '23 at 07:22

Viewed 88 times

I have a nested complex json file which has struct type, array type, list, dict nested within each other.

i have a function which flattens the columns with struct type, but when it encounters any other type it fails.

is their any recursive function which handles all these types properly and flatten to leaf level using pyspark dataframe?

code which i used to flatten struct type is:

def flatten_df(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
    parents, df = stack.pop()
    for column_name, column_type in df.dtypes:
        if column_type[:6] == "struct":
            projected_df = df.select(column_name + ".*")
            stack.append((parents + (column_name,), projected_df))
        else:
            columns.append(col(".".join(parents + (column_name,))).alias("_".join(parents + (column_name,))))
return nested_df.select(columns)

i need to handle empty struct,empty array,empty list ,empty dict also as it might have empty values also.

how to achieve this using pyspark?

asked Feb 19 '23 at 07:22

harshith

How to flatten complex nested json file using pyspark

0 Answers0