Reading JSON using Pyspark returns data frame full of nulls

Question

I have the following json structure in a file that i want to read using pyspark

[{'id': '34556',
  'InsuranceProvider': 'sdcsdf',
  'Type': {'Client': {'PaidIn': {'Insuranceid': '442211',
     'Insurancedesc': 'sdfsdf vdsfs',
     'purchaseditems': [{'InsuranceNumber': '1',
       'InsuranceLabel': 'SDF',
       'Insurancequantity': 1,
       'Insuranceprice': 234,
       'discountsreceived': [{'amount': 120,
         'description': 'Item 1, Discount 1'}],
       'childItems': [{'InsuranceNumber': '1',
         'InsuranceLabel': 'CSFGG',
         'Insurancequantity': 1,
         'Insuranceprice': 0,
         'discountsreceived': [{'amount': 452,
           'description': 'Insurance item 1, Discount 1'}]
}]}]}}}
 'eventTime': '2022-05-19T01:59:10.379Z'
 }]

I am using the following schema structure to read it:


    discount_type = StructType([StructField("amount", IntegerType(), True),
                    StructField("description", StringType(), True)])

    child_item_type = StructType([StructField("InsuranceNumber", StringType(), True),
                                      StructField("InsuranceLabel", StringType(), True),
                                      StructField("Insurancequantity", IntegerType(), True),
                                      StructField("Insuranceprice", IntegerType(), True),
                                      StructField("discountsreceived", ArrayType(discount_type) , True),
                                      ])

    item_type = StructType([StructField("InsuranceNumber", StringType(), True),
                                StructField("InsuranceLabel", StringType(), True),
                                StructField("Insurancequantity", IntegerType(), True),
                                StructField("Insuranceprice", IntegerType(), True),
                                StructField("discountsreceived", ArrayType(discount_type), True),
                                StructField("childItems",ArrayType(child_item_type) , True),
                                ])

    order_paid_type = StructType([StructField("Insuranceid", StringType(), True),
                                      StructField("Insurancedesc", StringType(), True),
                                      StructField("purchaseditems", ArrayType(item_type), True),
                                      ])

    message_type = StructType([StructField("PaidIn", order_paid_type, True)])
    data_type = StructType([StructField("Client", message_type, True)])

    body_type = StructType([StructField("id", StringType(), True),
                            StructField("InsuranceProvider", StringType(), True),
                            StructField("Type", data_type, True),
                            StructField("eventTime", StringType(), True),
                            ])

Can someone please point out what i am doing wrong here, because while reading the above json from a file using the following code and the above defined schema, all I get is a data frame full of nulls:


    data = spark.read.schema(schema).json(file_path)
    data.show()

Output:

+----+-----------------+----+---------+
|  id|InsuranceProvider|Type|eventTime|
+----+-----------------+----+---------+
|null|             null|null|     null|
|null|             null|null|     null|
|null|             null|null|     null|
|null|             null|null|     null|
|null|             null|null|     null|
|null|             null|null|     null|
|null|             null|null|     null|
|null|             null|null|     null|
|null|             null|null|     null|
+----+-----------------+----+---------+

If your JSON include new lines, you need `multiLine=True` in json function. — Emma, Jul 11 '22 at 20:53

score 1 · Answer 1 · answered Jul 12 '22 at 04:37

Aman you have an issue with the json data you have shared, it missing a , before eventTime field. Also incorporate Emma's suggestion on multiline. See if this helps..

Input:sample.json

[
  {'id': '34556',
  'InsuranceProvider': 'sdcsdf',
  'Type': {'Client': {'PaidIn': {'Insuranceid': '442211',
     'Insurancedesc': 'sdfsdf vdsfs',
     'purchaseditems': [
            {'InsuranceNumber': '1',
       'InsuranceLabel': 'SDF',
       'Insurancequantity': 1,
       'Insuranceprice': 234,
       'discountsreceived': [
                {'amount': 120,
         'description': 'Item 1, Discount 1'
                }
              ],
       'childItems': [
                {'InsuranceNumber': '1',
         'InsuranceLabel': 'CSFGG',
         'Insurancequantity': 1,
         'Insuranceprice': 0,
         'discountsreceived': [
                    {'amount': 452,
           'description': 'Insurance item 1, Discount 1'
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    },
 'eventTime': '2022-05-19T01: 59: 10.379Z'
  }
]

>>> df=spark.read.option("multiline","true").schema(body_type).json('sample.json')
>>> df.show()
+-----+-----------------+--------------------+--------------------+
|   id|InsuranceProvider|                Type|           eventTime|
+-----+-----------------+--------------------+--------------------+
|34556|           sdcsdf|{{{442211, sdfsdf...|2022-05-19T01: 59...|
+-----+-----------------+--------------------+--------------------+

guess the only thing i was missing the option("multiline","true"). Thanks and sorry for the trouble teedak8s and @Emma — Aman Mishra, Jul 12 '22 at 07:33
@teedaks although your above suggestion worked for me in reading the json file, but the resulted dataframe still had the `Type` column as a StructType and not an array type. I get the following output for Type field when i run df.schema.fields : StructField(Type,StructType(List(StructField(Client,StructType(List(StructField(PaidIn,StructType(List(StructField(Insuranceid,StringType,true),StructField(Insurancedesc,StringType,true),StructField(purchaseditems,ArrayType(StructType(List(StructField(InsuranceNumber,StringType,true),StructField(InsuranceLabel,StringType,true)...... etc — Aman Mishra, Jul 12 '22 at 19:17
Sorry but this would be separate question then, and moreover whatever you have defined in `body_type` would show up. — teedak8s, Jul 12 '22 at 21:46

Reading JSON using Pyspark returns data frame full of nulls

1 Answers1