1

This is in a Microsoft Azure data lake running on azure databricks. I'm trying to read a JSON file, that I did not create, which has the schema, or name and type information for CSV's that I can read, but have no header in the CSV.

df1 = spark.read.json('/mnt/jsontest/...PATH.../SalesTable.cdm.json', multiLine=True)
df1.printSchema() 

That loads a pyspark DataFrame and prints this.

What I want is Struct Schema I can use to read the CSV files. I haven't been able to get a dataframe of the definitions array, that I could loop and create the struct elements.

root
 |-- definitions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- displayName: string (nullable = true)
 |    |    |-- entityName: string (nullable = true)
 |    |    |-- hasAttributes: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- dataFormat: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- purpose: string (nullable = true)
 |-- imports: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- corpusPath: string (nullable = true)
 |-- jsonSchemaSemanticVersion: string (nullable = true)

If I could loop the hasAttributes, I would have the dataFormat and name field I think to make the schema structure. I also can't read the json, directly as json. Its always a DataFrame. If I try to use a normal python3 read, the file can't be found. What is the best most normal way to traverse a json to get to the fields I want?

Most examples do something like this - new_schema = StructType.fromJson(first_json) But because the JSON I have doesn't have fields, I get an error. hasAttributes is where fields typically is or contains generally the same information.

Any help in how to create the struct or read this json as json, or format (explode/flatten) the DataFrame would be great. I'd think there must be something more simple in pyspark to do something so simple in base python3.

  • `definitions` is an array, how would you know which definition is the CSV schema that you need? – pltc Oct 25 '21 at 19:32
  • looking again - I'll update the question because definition is sort of the fields column - So each definition has the displayname for the csv column I want, and then the hasAttributes has the type of data I want. – user1567438 Oct 25 '21 at 19:43
  • can you paste a json sample here, just one column is enough? it's hard to imagine – pltc Oct 25 '21 at 19:56
  • This is from selecting definitions.displayname and then flatten on the hasAttributes. |[SalesTable] |[{String, _SysRowId, hasA}, {String, LSN, hasA}, {DateTime, LastProcessedChange_DateTime, hasA}, {DateTime, DataLakeModified_DateTime, hasA}, {Int64, RECID, hasA}, {String, LINEDISC, hasA}, {String, SALESID, hasA}, {String, TDSGROUP_IN, hasA}, {Int64, ADDRESSREFRECID, hasA}, – user1567438 Oct 25 '21 at 19:57
  • 1
    what if you collect the attributes like this `select('definitions.displayname', 'flatten_attribute').collect()`, then filter and get all information you need? – pltc Oct 25 '21 at 20:10
  • 1
    sdf1 = df1.select(df1.definitions.displayName,flatten(df1.definitions.hasAttributes)).collect() print(sdf1) I sort of expected something like at dataframe.. DisplayName, DataFormat, name, purpose SalesTable, String, _SysRowID, hasA SalesTable, String, LSN, hasA etc. So I could loop that and create the struct. instead I get this output below. Gives this output: [Row(definitions.displayName=['SalesTable'], flatten(definitions[hasAttributes])=[Row(dataFormat='String', name='_SysRowId', purpose='hasA'), Row(dataFormat='String', name='LSN', purpose='hasA'), – user1567438 Oct 27 '21 at 13:25
  • 1
    ok - Thank you! The .collect() made it all work sort of as I wanted. I hadn't been using that. It returned a list, and I was able to loop the list. Thank you again. – user1567438 Oct 27 '21 at 13:32

0 Answers0