0

I needed to flatten a simple Json file (json lines) and convert it into a Parquet format within a Spark Notebook in Azure Synapse Analytics. There is only one level of nested object for any column. However, I discovered that getting the schema of the dataframe did not return the schema of the nested object. I was using c# so that other company developers would not have to learn the other supported languages.

bmukes
  • 119
  • 2
  • 9
  • Have you got some sample JSON and expected results? I know you’ve self-answered but if the requirement is that simple then straightforward SQL could handle this, eg OPENJSON, JSON_VALUE etc – wBob Feb 13 '22 at 13:58
  • The JSON in question is a document where every line of code is a JSON document see (https://jsonlines.org/). Simply create a multi-line document and have one of the properties be an object that you want flattened. You can of course skip the flattening and query the children of the JSON object by using the JSON_VALUE function in TSQL see(https://learn.microsoft.com/en-us/sql/t-sql/functions/json-value-transact-sql?view=sql-server-ver15) but I prefer to have it flattened. – bmukes Feb 14 '22 at 15:08

1 Answers1

0

The code below will work for the situation indicated above. Hope it saves someone else a few hours. It also drops the parent column from the data frame after the properties of the child are added into the parent dataframe.

I had no need to make this code into a recursive transversal because we have no nested objects.

using System;
using System.Collections.Generic;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;
using System.Diagnostics;


var df = spark.Read().Json("{Your source file path here}");

//get the schema of the data frame
var dfSchema = df.Schema() ;
// traverse the schema of the dataframe
foreach(var parentSchemaField in dfSchema.Fields) {
    
    if (parentSchemaField.DataType is StructType) {
        // get a new dataframe that just contains the child data from the parent
        var childFrame = df.Select($"{parentSchemaField.Name}.*") ;
        // traverse the schema of the child dataframe
        foreach(var childSchemaField in childFrame.Schema().Fields) {   
            //make a new column in the parent dataframe for each parents child property
            df = df.WithColumn($"{parentSchemaField.Name}.{childSchemaField.Name}",Col($"{parentSchemaField.Name}.{childSchemaField.Name}")) ;
        }
        // drop the parent column from the data frame its no longer needed
        df = df.Drop(parentSchemaField.Name) ;
    }
}
df.Write().Parquet("{Your sink file path here}") ;
bmukes
  • 119
  • 2
  • 9