I have an xml message from a kafka topic and im looking to convert the incoming message to a dataframe and use it further. I have achieved the same for json messages but not with xml. if anyone could help on the same.
code with json(working alright):
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("key", StringType()),
StructField("value", StringType())
])
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "topic-name") \
.load() \
.select(from_json(df.value.cast("string"), schema).alias("parsed_value"))
I tried to use xml parse from xmltodict module to create a udf and apply it on the field of the data frame. It works as expected.
import xmltodict
import json
from pyspark.sql import functions as sf
from pyspark.sql.types import ArrayType, StringType
def xml_to_json_udf(xml_string):
xml_dict = xmltodict.parse(xml_string)
json_data = json.dumps(xml_dict)
return json_data
xml_to_json_udf = sf.udf(parse_xml, StringType())
Later used in the dataframe like this :
df = df.withColumn('xml_column_as_xml', xml_to_json_udf(df._kafka_value)) # _kafka_value in string datatype
However one error that i face is when the message comes from a kafka producer the string value doesnt get parsed as an xml(still trying to figure it out).