I am reading JSON from pages and appending to list using a loop. When tried to read to spark its throwing _corrupt_record error. I have gone through couple of posts regarding this but none of the solution worked. Any suggestions on how to fix this?
total_results = []
response = requests.get(getURL, headers=headers)
data = response.json()
total_results.append(data)
.......
.......
rdd = spark.sparkContext.parallelize((total_results) )
print(rdd)
df = spark.read.option('multiline','true').json(rdd)
df.show()
Error below
ParallelCollectionRDD[189] at parallelize at PythonRDD.scala:195
+--------------------+
| _corrupt_record|
+--------------------+
|[{'API_UWI': '42-...|
|[{'API_UWI': '33-...|
+--------------------+
Sample data from output
{"_corrupt_record":"[{'abc_pqq': '12-00-45672', 'abc_pqq_12': '12-00-
45672-00', 'Unformatted': '0421733644800',............
{"_corrupt_record":"[{'abc_pqq': '13-10-45672', 'abc_pqq_12': '322-173-
36499-00', 'Unformatted': '222223644800',..........
{"_corrupt_record":"[{'abc_pqq': '22-223-45678', 'abc_pqq_12': '22-111-
9876543', 'Unformatted': '567890000',...................
{"_corrupt_record":"[{'abc_pqq': '33-22-678900', 'abc_pqq_12': '99-88-
7654321', 'Unformatted': '111111111',...............
....................
................
Output from total_results list
[[{'abc_pqq': '11-111-1111', 'abc_pqq_12': '11-111-1111-1111',
'Unformatted': '421733878600', 'abc_pqq_14': '22-222-222-22222',
.............................................'ID': 82346790000},
{'abc_pqq': '11-222-2222', 'abc_pqq_12': '22-222-222-22222',
'Unformatted': '420230106900', 'abc_pqq_14': '44-444-444-444444',
'..............................................'