I have a use case that needs to read a nested JSON schema and write it back as a Parquet
(My schema changes based on the day I am reading the data so I don't know the exact schema in advance) since in some of my nest keys I have some character like space when I want to save it as parquet I am getting an exception complaining about special character ,;{}()\\n\\t=
This is a sample Schema it's not real schema keys are dynamic and chages day by day
val nestedSchema = StructType(Seq(
StructField("event_time", StringType),
StructField("event_id", StringType),
StructField("app", StructType(Seq(
StructField("environment", StringType),
StructField("name", StringType),
StructField("type", StructType(Seq(
StructField("word tier", StringType), ### This cause problem when you save it as Parquet
StructField("level", StringType)
))
))))))
val nestedDF = spark.createDataFrame(sc.emptyRDD[Row], nestedSchema)
myDF.printSchema
Output
root
|-- event_time: string (nullable = true)
|-- event_id: string (nullable = true)
|-- app: struct (nullable = true)
| |-- environment: string (nullable = true)
| |-- name: string (nullable = true)
| |-- type: struct (nullable = true)
| | |-- word tier: string (nullable = true)
| | |-- level: string (nullable = true)
Trying to save as parquet
myDF.write
.mode("overwrite")
.option("compression", "snappy")
.parquet("PATH/TO/DESTINATION")
I Found some solution like this
myDF.toDF(myDF
.schema
.fieldNames
.map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)
.write
.mode("overwrite")
.option("compression", "snappy")
.parquet("PATH/TO/DESTINATION")
But it only works on a parent keys, not on a nested one. Is there any recursive solution for this?
My question is not a duplicate of this question Since my schema is dynamic and I don't know about what are my keys. It changes based on the data I am reading, so my solution should be generic, I need somehow recursively created the same schema structure but with a key a correct name.