I am trying to convert XML to JSON in my DataFrame. I have the following
def xmlparse(line):
return json.dumps(xmltodict.parse(line))
The column 'XML_Data' in my DataFrame has XML in it.
testing = t.select('XML_Data').rdd.map(xmlparse)
testing.take(1)
returns
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 338, wn0-uticas.ffrd5tvlixoubfzdt0g523uj1f.cx.internal.cloudapp.net, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 1338, in takeUpToNumLeft
yield next(iterator)
File "<stdin>", line 2, in xmlparse
File "/usr/bin/anaconda/envs/py35/lib/python3.5/site-packages/xmltodict.py", line 330, in parse
parser.Parse(xml_input, True)
TypeError: a bytes-like object is required, not 'Row'
Assuming the error is in my xmlparse function, how to do properly map to the row object so I return bytes or a string?
Schema of t
root
|-- TransactionMembership: string (nullable = true)
|-- XML_Data: string (nullable = true)
DataFrame is 60k rows total