TypeError: a bytes-like object is required, not 'Row' Spark RDD Map

Question

I am trying to convert XML to JSON in my DataFrame. I have the following

def xmlparse(line):
    return json.dumps(xmltodict.parse(line))

The column 'XML_Data' in my DataFrame has XML in it.

testing = t.select('XML_Data').rdd.map(xmlparse)

testing.take(1) returns

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 338, wn0-uticas.ffrd5tvlixoubfzdt0g523uj1f.cx.internal.cloudapp.net, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
    process()
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 1338, in takeUpToNumLeft
    yield next(iterator)
  File "<stdin>", line 2, in xmlparse
  File "/usr/bin/anaconda/envs/py35/lib/python3.5/site-packages/xmltodict.py", line 330, in parse
    parser.Parse(xml_input, True)
TypeError: a bytes-like object is required, not 'Row'

Assuming the error is in my xmlparse function, how to do properly map to the row object so I return bytes or a string?

Schema of t

root
 |-- TransactionMembership: string (nullable = true)
 |-- XML_Data: string (nullable = true)

DataFrame is 60k rows total

Can you [edit](https://stackoverflow.com/posts/49265274/edit) this post and add the output of `t.printSchema()`? Also, it would be helpful if you could provide an [mcve]. Read more on [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Mar 13 '18 at 20:28
Try: `testing = t.select('XML_Data').rdd.map(lambda row: xmlparse(row['XML_Data']))` — pault, Mar 13 '18 at 20:32

score 0 · Accepted Answer · answered Mar 18 '18 at 00:48

0

testing = t.select('XML_Data').rdd.map(lambda row: xmlparse(row['XML_Data']))

answered Mar 18 '18 at 00:48

mdeonte001

39
9

TypeError: a bytes-like object is required, not 'Row' Spark RDD Map

1 Answers1