I have the following json structure in a file that i want to read using pyspark
[{'id': '34556',
'InsuranceProvider': 'sdcsdf',
'Type': {'Client': {'PaidIn': {'Insuranceid': '442211',
'Insurancedesc': 'sdfsdf vdsfs',
'purchaseditems': [{'InsuranceNumber': '1',
'InsuranceLabel': 'SDF',
'Insurancequantity': 1,
'Insuranceprice': 234,
'discountsreceived': [{'amount': 120,
'description': 'Item 1, Discount 1'}],
'childItems': [{'InsuranceNumber': '1',
'InsuranceLabel': 'CSFGG',
'Insurancequantity': 1,
'Insuranceprice': 0,
'discountsreceived': [{'amount': 452,
'description': 'Insurance item 1, Discount 1'}]
}]}]}}}
'eventTime': '2022-05-19T01:59:10.379Z'
}]
I am using the following schema structure to read it:
discount_type = StructType([StructField("amount", IntegerType(), True),
StructField("description", StringType(), True)])
child_item_type = StructType([StructField("InsuranceNumber", StringType(), True),
StructField("InsuranceLabel", StringType(), True),
StructField("Insurancequantity", IntegerType(), True),
StructField("Insuranceprice", IntegerType(), True),
StructField("discountsreceived", ArrayType(discount_type) , True),
])
item_type = StructType([StructField("InsuranceNumber", StringType(), True),
StructField("InsuranceLabel", StringType(), True),
StructField("Insurancequantity", IntegerType(), True),
StructField("Insuranceprice", IntegerType(), True),
StructField("discountsreceived", ArrayType(discount_type), True),
StructField("childItems",ArrayType(child_item_type) , True),
])
order_paid_type = StructType([StructField("Insuranceid", StringType(), True),
StructField("Insurancedesc", StringType(), True),
StructField("purchaseditems", ArrayType(item_type), True),
])
message_type = StructType([StructField("PaidIn", order_paid_type, True)])
data_type = StructType([StructField("Client", message_type, True)])
body_type = StructType([StructField("id", StringType(), True),
StructField("InsuranceProvider", StringType(), True),
StructField("Type", data_type, True),
StructField("eventTime", StringType(), True),
])
Can someone please point out what i am doing wrong here, because while reading the above json from a file using the following code and the above defined schema, all I get is a data frame full of nulls:
data = spark.read.schema(schema).json(file_path)
data.show()
Output:
+----+-----------------+----+---------+
| id|InsuranceProvider|Type|eventTime|
+----+-----------------+----+---------+
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
|null| null|null| null|
+----+-----------------+----+---------+