This is in a Microsoft Azure data lake running on azure databricks. I'm trying to read a JSON file, that I did not create, which has the schema, or name and type information for CSV's that I can read, but have no header in the CSV.
df1 = spark.read.json('/mnt/jsontest/...PATH.../SalesTable.cdm.json', multiLine=True)
df1.printSchema()
That loads a pyspark DataFrame and prints this.
What I want is Struct Schema I can use to read the CSV files. I haven't been able to get a dataframe of the definitions array, that I could loop and create the struct elements.
root
|-- definitions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayName: string (nullable = true)
| | |-- entityName: string (nullable = true)
| | |-- hasAttributes: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- dataFormat: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- purpose: string (nullable = true)
|-- imports: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- corpusPath: string (nullable = true)
|-- jsonSchemaSemanticVersion: string (nullable = true)
If I could loop the hasAttributes, I would have the dataFormat and name field I think to make the schema structure. I also can't read the json, directly as json. Its always a DataFrame. If I try to use a normal python3 read, the file can't be found. What is the best most normal way to traverse a json to get to the fields I want?
Most examples do something like this - new_schema = StructType.fromJson(first_json) But because the JSON I have doesn't have fields, I get an error. hasAttributes is where fields typically is or contains generally the same information.
Any help in how to create the struct or read this json as json, or format (explode/flatten) the DataFrame would be great. I'd think there must be something more simple in pyspark to do something so simple in base python3.