_corrupt_record reading a JSON list in SPARK

Question

I am reading JSON from pages and appending to list using a loop. When tried to read to spark its throwing _corrupt_record error. I have gone through couple of posts regarding this but none of the solution worked. Any suggestions on how to fix this?

total_results = []

response = requests.get(getURL, headers=headers)
data = response.json()
total_results.append(data) 
.......
.......
rdd = spark.sparkContext.parallelize((total_results) )
print(rdd)
df = spark.read.option('multiline','true').json(rdd)
df.show()


Error below

ParallelCollectionRDD[189] at parallelize at PythonRDD.scala:195
                                                  
+--------------------+
|     _corrupt_record|
+--------------------+
|[{'API_UWI': '42-...|
|[{'API_UWI': '33-...|
+--------------------+
Sample data  from output

{"_corrupt_record":"[{'abc_pqq': '12-00-45672', 'abc_pqq_12': '12-00- 
45672-00', 'Unformatted': '0421733644800',............
{"_corrupt_record":"[{'abc_pqq': '13-10-45672', 'abc_pqq_12': '322-173- 
36499-00', 'Unformatted': '222223644800',..........
{"_corrupt_record":"[{'abc_pqq': '22-223-45678', 'abc_pqq_12': '22-111- 
9876543', 'Unformatted': '567890000',...................
{"_corrupt_record":"[{'abc_pqq': '33-22-678900', 'abc_pqq_12': '99-88- 
7654321', 'Unformatted': '111111111',...............
....................
................

Output from total_results list

[[{'abc_pqq': '11-111-1111', 'abc_pqq_12': '11-111-1111-1111', 
'Unformatted': '421733878600', 'abc_pqq_14': '22-222-222-22222', 
.............................................'ID': 82346790000}, 
{'abc_pqq': '11-222-2222', 'abc_pqq_12': '22-222-222-22222', 
'Unformatted': '420230106900', 'abc_pqq_14': '44-444-444-444444', 
'..............................................'

i think all the json schema is different and cannot be expressed as a df. — Lamanus, Feb 05 '23 at 14:14
@lamanus - sorry my bad..when i changed the original data to dummy (as i cannot provide the real data here) the schema got changed. I corrected now in the post — Arun.K, Feb 05 '23 at 17:33
Mine is Spark pool in Azure so i guess its the latest version. Does version matter here? — Arun.K, Feb 06 '23 at 00:03
Same code, just hardcoded variable total_results with JSON sample provided by you — Mohana B C, Feb 06 '23 at 06:51
I think i got the issue. I am seeing many None instead of Null. All the JSON notations Null got changed to Python notation None. Is there a way to change all the None to Null in the list total_results? I tried json_loads(data), but the result is same. Possible to change all the None to Null in the list in one go? — Arun.K, Feb 06 '23 at 06:56
Ok i fixed it. In case some one faces the issue, this is what i did. I parsed the incoming json data to json.dumps(data) and it fixed the issue for me. — Arun.K, Feb 06 '23 at 21:53

_corrupt_record reading a JSON list in SPARK

0 Answers0